HDA Exam
Question 1: Descriptive Table
Build a descriptive table, comparing patients with the event death (within 10 years) versus patients without the event in that time frame. Insert also a column with the total population descriptive statistics. Comment about percentages of missing data in the candidate predictors.
1.1 Format variables
#explore the type of variables
str(survey)## 'data.frame': 6989 obs. of 21 variables:
## $ Age : int 39 65 74 57 74 44 39 70 42 72 ...
## $ Diastolic.BP : int 90 85 110 82 80 95 90 95 75 80 ...
## $ Poverty.index : int 52 137 89 129 63 99 258 54 258 184 ...
## $ Race : int 2 2 2 2 2 2 2 2 1 1 ...
## $ Red.blood.cells : num 4.91 4.08 4.46 4.33 4.73 3.81 4.69 4.93 4.11 5.01 ...
## $ Sedimentation.rate: int 25 20 46 22 8 50 4 32 18 13 ...
## $ Serum.Albumin : num 4.9 5.1 4.3 4.4 4.4 4.6 4.4 4.2 5 4.5 ...
## $ Serum.Cholesterol : num 243 254 337 249 141 ...
## $ Serum.Iron : int 84 84 102 95 131 86 102 144 111 161 ...
## $ Serum.Magnesium : num 1.62 1.74 1.91 1.55 1.81 1.8 1.8 1.61 1.69 1.78 ...
## $ Serum.Protein : num 8.8 7.8 7.5 7.7 7.5 7.7 7 8.3 6.8 6.6 ...
## $ Sex : int 2 2 2 2 1 2 2 1 2 1 ...
## $ Systolic.BP : int 135 180 190 150 NA 150 120 NA 115 NA ...
## $ TIBC : int 423 363 287 362 315 382 401 352 468 367 ...
## $ TS : num 19.9 23.1 35.5 26.2 41.6 22.5 25.4 40.9 23.7 43.9 ...
## $ White.blood.cells : num 7.8 8.4 8.1 6.1 3.8 6.5 6.1 5.2 6 8.4 ...
## $ BMI : num 42 28.6 23 33.4 27.3 ...
## $ Pulse.pressure : int 45 95 80 68 62 55 30 105 40 70 ...
## $ time : num 20.9 21.1 12.8 20 20.9 ...
## $ status : int 0 0 1 1 0 0 0 1 0 1 ...
## $ death : int 0 0 0 0 0 0 0 1 0 0 ...
Some variables coded as int are actually categorical, thus I proceed to recode them as factor.
#rename time variable to avoid conflicts
names(survey)[names(survey) == "time"] <- "time_death"
# recode categoric variable as factor
categoric_vars<-c('Race','Sex','status','death')
survey[,categoric_vars]<-lapply(survey[,categoric_vars], as.factor)
numeric_vars<- survey %>% dplyr::select(-(all_of(categoric_vars))) %>% colnames()
#assign labels in order to make easier the comparisons
survey$Race=factor(survey$Race,
labels = c("White",
"afro-american",
"other"))
survey$Sex = factor(survey$Sex, labels = c("male", "female"))
str(survey)## 'data.frame': 6989 obs. of 21 variables:
## $ Age : int 39 65 74 57 74 44 39 70 42 72 ...
## $ Diastolic.BP : int 90 85 110 82 80 95 90 95 75 80 ...
## $ Poverty.index : int 52 137 89 129 63 99 258 54 258 184 ...
## $ Race : Factor w/ 3 levels "White","afro-american",..: 2 2 2 2 2 2 2 2 1 1 ...
## $ Red.blood.cells : num 4.91 4.08 4.46 4.33 4.73 3.81 4.69 4.93 4.11 5.01 ...
## $ Sedimentation.rate: int 25 20 46 22 8 50 4 32 18 13 ...
## $ Serum.Albumin : num 4.9 5.1 4.3 4.4 4.4 4.6 4.4 4.2 5 4.5 ...
## $ Serum.Cholesterol : num 243 254 337 249 141 ...
## $ Serum.Iron : int 84 84 102 95 131 86 102 144 111 161 ...
## $ Serum.Magnesium : num 1.62 1.74 1.91 1.55 1.81 1.8 1.8 1.61 1.69 1.78 ...
## $ Serum.Protein : num 8.8 7.8 7.5 7.7 7.5 7.7 7 8.3 6.8 6.6 ...
## $ Sex : Factor w/ 2 levels "male","female": 2 2 2 2 1 2 2 1 2 1 ...
## $ Systolic.BP : int 135 180 190 150 NA 150 120 NA 115 NA ...
## $ TIBC : int 423 363 287 362 315 382 401 352 468 367 ...
## $ TS : num 19.9 23.1 35.5 26.2 41.6 22.5 25.4 40.9 23.7 43.9 ...
## $ White.blood.cells : num 7.8 8.4 8.1 6.1 3.8 6.5 6.1 5.2 6 8.4 ...
## $ BMI : num 42 28.6 23 33.4 27.3 ...
## $ Pulse.pressure : int 45 95 80 68 62 55 30 105 40 70 ...
## $ time_death : num 20.9 21.1 12.8 20 20.9 ...
## $ status : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 2 ...
## $ death : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
1.2 Visualization
Sometimes visualize data can help to understand better their meaning.
Categorical variables
# categorical variables
barR <- ggplot(data = survey, aes(x=Race, fill = "Race")) +
geom_bar(width = 0.6) + xlab("Race")+
ggtitle("Race") +
scale_fill_manual(values=c("Orange", "#bb44f0", "#45ADA8"))+theme_classic()
barS <- ggplot(data = survey, aes( x =status, fill = "status")) +
geom_bar(width = 0.6) +
xlab("Status") +ylab("Count")+
ggtitle("Status") +
scale_fill_manual(values=c("#bb44f0"))+theme_classic()
#stacked.barS
barSe <- ggplot(data = survey, aes( x =Sex, fill = "Sex")) +
geom_bar(width = 0.6) +
xlab("Sex") +ylab("Count")+
ggtitle("Sex") +
scale_fill_manual(values=c("#45ADA8")) + theme_classic()
#stacked.barSe
tot.p <- grid.arrange(barR,barS,barSe, ncol = 3) Let’s see how data distributed among the variable death.
windowsFonts(A = windowsFont("Arvo"))
# categorical variables
# consider the relative frequency to have a better understanding
stacked.barR <- ggplot(data = survey, aes( x =Race, fill = death)) +
geom_bar(position = "fill") +
xlab("Race") +ylab("Relative Frequency")+
ggtitle("Deaths per Race") +
scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()
#stacked.barR
stacked.barS <- ggplot(data = survey, aes( x =status, fill = death)) +
geom_bar(position = "fill") +
xlab("Status") +ylab("Relative Frequency")+
ggtitle("Deaths per status") +
scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()
#stacked.barS
stacked.barSe <- ggplot(data = survey, aes( x =Sex, fill = death)) +
geom_bar(position = "fill") +
xlab("Sex") +ylab("Relative Frequency")+
ggtitle("Deaths per Sex") +
scale_fill_manual(values=c("Orange", "#bb44f0")) + theme_classic()
#stacked.barSe
tot.p <- grid.arrange(stacked.barR,stacked.barS, stacked.barSe, ncol = 3) Continous variables
Since some variables appear to be positively skewed we apply square root or log transformation to make them more normally-distributed.
#transformation of Skewed variables
transform_sqrt<-c("Pulse.pressure", "Serum.Iron", "Poverty.index")
transform_log <-c("BMI", "Serum.Cholesterol","Red.blood.cells","White.blood.cells", "Sedimentation.rate")
survey_trans<-survey
survey_trans[,transform_log]<-lapply(survey[,transform_log], log)
survey_trans[,transform_sqrt]<-lapply(survey[,transform_sqrt], sqrt)
for (var in c(transform_log, transform_sqrt)){
hist.p<-gghistogram(survey, x = var, y = "..count..", add = "mean",bins = 20, fill = "orange", rug = TRUE, color = "orange", title = "Original")+theme_classic()
hist.t<-gghistogram(survey_trans, x = var, y = "..count..", add = "mean", bins = 20, fill = "orange",rug = TRUE, color = "Orange", title = "Transformed")+theme_classic()
tot.p <- grid.arrange(hist.p, hist.t, ncol = 2)
}Let’s see how data distributed among the variable death.
for (var in numeric_vars){
hist.p<-gghistogram(survey, x = var, y = "..density..", add = "median", bins = 20, rug = TRUE,fill = "death", add_density = TRUE, palette = c("Orange", "#bb44f0"))
bx.p <- ggboxplot(survey, x="death", y=var, ylab = var, color ="death" ,palette = c("Orange", "#bb44f0"))
tot.p <- grid.arrange(bx.p,hist.p, ncol = 2)
}1.3 Descriptive table
A first rough overview on descriptive statistics can be given by function summary()
summary(survey)## Age Diastolic.BP Poverty.index Race
## Min. :25.00 Min. : 34.00 Min. : 2.0 White :5751
## 1st Qu.:35.00 1st Qu.: 74.00 1st Qu.:137.0 afro-american:1158
## Median :49.00 Median : 82.00 Median :238.0 other : 80
## Mean :49.76 Mean : 83.32 Mean :289.5
## 3rd Qu.:66.00 3rd Qu.: 90.00 3rd Qu.:373.0
## Max. :74.00 Max. :180.00 Max. :999.0
##
## Red.blood.cells Sedimentation.rate Serum.Albumin Serum.Cholesterol
## Min. :2.550 Min. : 1.00 Min. :2.700 Min. : 53.0
## 1st Qu.:4.400 1st Qu.: 7.00 1st Qu.:4.200 1st Qu.:188.0
## Median :4.710 Median :13.00 Median :4.400 Median :218.0
## Mean :4.733 Mean :15.86 Mean :4.372 Mean :222.5
## 3rd Qu.:5.050 3rd Qu.:22.00 3rd Qu.:4.600 3rd Qu.:250.0
## Max. :8.880 Max. :65.00 Max. :5.600 Max. :793.0
##
## Serum.Iron Serum.Magnesium Serum.Protein Sex Systolic.BP
## Min. : 17.0 Min. :0.82 Min. : 4.4 male :2744 Min. : 80.0
## 1st Qu.: 75.0 1st Qu.:1.59 1st Qu.: 6.8 female:4245 1st Qu.:115.0
## Median : 96.0 Median :1.68 Median : 7.1 Median :126.0
## Mean :100.6 Mean :1.68 Mean : 7.1 Mean :130.7
## 3rd Qu.:121.0 3rd Qu.:1.77 3rd Qu.: 7.4 3rd Qu.:140.0
## Max. :396.0 Max. :2.70 Max. :11.5 Max. :260.0
## NA's :1389
## TIBC TS White.blood.cells BMI
## Min. :168.0 Min. : 3.20 Min. : 2.30 Min. :12.94
## 1st Qu.:323.0 1st Qu.: 21.00 1st Qu.: 6.00 1st Qu.:22.14
## Median :356.0 Median : 27.10 Median : 7.20 Median :24.93
## Mean :362.9 Mean : 28.36 Mean : 7.46 Mean :25.68
## 3rd Qu.:396.0 3rd Qu.: 34.30 3rd Qu.: 8.60 3rd Qu.:28.31
## Max. :691.0 Max. :100.00 Max. :56.00 Max. :72.22
##
## Pulse.pressure time_death status death
## Min. : 10.00 Min. : 0.01644 0:4488 0:5888
## 1st Qu.: 40.00 1st Qu.:13.84429 1:2501 1:1101
## Median : 48.00 Median :18.84977
## Mean : 51.45 Mean :16.28802
## 3rd Qu.: 60.00 3rd Qu.:20.01370
## Max. :150.00 Max. :21.81142
##
From this summarywe can already notice that just one variable has missing values: Systolic.BP.
Let’s see the descriptive statistics in a fancier way, comparing the patients with the event and those without it, adding also a column with total population’s statistics.
# Descriptive statistics of patients'characteristics by outcome
survey %>%
dplyr::select(all_of(numeric_vars), all_of(categoric_vars)) %>%
tbl_summary(by = death,
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c(
"{mean} ({sd})",
"{median} ({p25}, {p75})", #median with 1th-3th percentiles
"{min}, {max}",
"{N_miss} ({p_miss} %)"),
missing = "no") %>%
add_overall() %>%
add_p() %>%
bold_labels()| Characteristic | Overall, N = 6,9891 | 0, N = 5,888 | 1, N = 1,101 | p-value2 |
|---|---|---|---|---|
| Age | <0.001 | |||
| Mean (SD) | 50 (16) | 47 (15) | 64 (10) | |
| Median (IQR) | 49 (35, 66) | 44 (33, 62) | 68 (63, 71) | |
| Range | 25, 74 | 25, 74 | 25, 74 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Diastolic.BP | <0.001 | |||
| Mean (SD) | 83 (13) | 83 (13) | 87 (15) | |
| Median (IQR) | 82 (74, 90) | 80 (74, 90) | 86 (78, 96) | |
| Range | 34, 180 | 34, 180 | 50, 144 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Poverty.index | <0.001 | |||
| Mean (SD) | 289 (220) | 298 (218) | 246 (223) | |
| Median (IQR) | 238 (137, 373) | 248 (148, 377) | 172 (99, 312) | |
| Range | 2, 999 | 2, 999 | 4, 999 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Red.blood.cells | 0.021 | |||
| Mean (SD) | 4.73 (0.49) | 4.73 (0.47) | 4.76 (0.55) | |
| Median (IQR) | 4.71 (4.40, 5.05) | 4.71 (4.40, 5.03) | 4.74 (4.38, 5.12) | |
| Range | 2.55, 8.88 | 2.55, 8.88 | 2.93, 6.56 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Sedimentation.rate | <0.001 | |||
| Mean (SD) | 16 (11) | 15 (11) | 20 (14) | |
| Median (IQR) | 13 (7, 22) | 13 (7, 21) | 17 (9, 29) | |
| Range | 1, 65 | 1, 60 | 1, 65 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Serum.Albumin | <0.001 | |||
| Mean (SD) | 4.37 (0.33) | 4.39 (0.32) | 4.27 (0.34) | |
| Median (IQR) | 4.40 (4.20, 4.60) | 4.40 (4.20, 4.60) | 4.30 (4.10, 4.50) | |
| Range | 2.70, 5.60 | 2.70, 5.60 | 2.70, 5.60 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Serum.Cholesterol | <0.001 | |||
| Mean (SD) | 222 (50) | 221 (49) | 232 (53) | |
| Median (IQR) | 218 (188, 250) | 216 (187, 248) | 227 (196, 262) | |
| Range | 53, 793 | 53, 793 | 87, 691 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Serum.Iron | <0.001 | |||
| Mean (SD) | 101 (37) | 101 (37) | 98 (37) | |
| Median (IQR) | 96 (75, 121) | 96 (76, 121) | 93 (72, 116) | |
| Range | 17, 396 | 17, 301 | 23, 396 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Serum.Magnesium | 0.035 | |||
| Mean (SD) | 1.68 (0.14) | 1.68 (0.14) | 1.67 (0.16) | |
| Median (IQR) | 1.68 (1.59, 1.77) | 1.68 (1.59, 1.77) | 1.67 (1.58, 1.77) | |
| Range | 0.82, 2.70 | 0.82, 2.70 | 0.98, 2.49 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Serum.Protein | 0.077 | |||
| Mean (SD) | 7.10 (0.50) | 7.09 (0.49) | 7.13 (0.54) | |
| Median (IQR) | 7.10 (6.80, 7.40) | 7.10 (6.80, 7.40) | 7.10 (6.80, 7.40) | |
| Range | 4.40, 11.50 | 4.40, 11.50 | 5.30, 10.20 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Systolic.BP | <0.001 | |||
| Mean (SD) | 131 (23) | 129 (22) | 146 (28) | |
| Median (IQR) | 126 (115, 140) | 124 (114, 140) | 142 (125, 162) | |
| Range | 80, 260 | 80, 260 | 88, 240 | |
| N missing (% missing %) | 1,389 (20 %) | 879 (15 %) | 510 (46 %) | |
| TIBC | <0.001 | |||
| Mean (SD) | 363 (58) | 365 (58) | 350 (56) | |
| Median (IQR) | 356 (323, 396) | 358 (326, 397) | 345 (313, 382) | |
| Range | 168, 691 | 175, 691 | 168, 607 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| TS | 0.8 | |||
| Mean (SD) | 28 (11) | 28 (11) | 29 (11) | |
| Median (IQR) | 27 (21, 34) | 27 (21, 34) | 27 (21, 35) | |
| Range | 3, 100 | 3, 100 | 4, 100 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| White.blood.cells | 0.020 | |||
| Mean (SD) | 7.46 (2.29) | 7.42 (2.18) | 7.65 (2.80) | |
| Median (IQR) | 7.20 (6.00, 8.60) | 7.20 (6.00, 8.50) | 7.30 (6.00, 8.80) | |
| Range | 2.30, 56.00 | 2.30, 51.20 | 2.90, 56.00 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| BMI | 0.070 | |||
| Mean (SD) | 25.7 (5.1) | 25.6 (5.1) | 25.8 (5.3) | |
| Median (IQR) | 24.9 (22.1, 28.3) | 24.9 (22.1, 28.2) | 25.2 (22.2, 28.7) | |
| Range | 12.9, 72.2 | 14.2, 63.2 | 12.9, 72.2 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Pulse.pressure | <0.001 | |||
| Mean (SD) | 51 (18) | 49 (17) | 63 (22) | |
| Median (IQR) | 48 (40, 60) | 46 (38, 58) | 60 (46, 76) | |
| Range | 10, 150 | 10, 150 | 12, 150 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| time_death | <0.001 | |||
| Mean (SD) | 16.3 (5.5) | 18.3 (2.7) | 5.4 (2.8) | |
| Median (IQR) | 18.8 (13.8, 20.0) | 19.1 (18.1, 20.2) | 5.5 (3.1, 7.9) | |
| Range | 0.0, 21.8 | 10.0, 21.8 | 0.0, 10.0 | |
| N missing (% missing %) | 0 (0 %) | 0 (0 %) | 0 (0 %) | |
| Race | <0.001 | |||
| White | 5,751 (82%) | 4,899 (83%) | 852 (77%) | |
| afro-american | 1,158 (17%) | 925 (16%) | 233 (21%) | |
| other | 80 (1.1%) | 64 (1.1%) | 16 (1.5%) | |
| Sex | <0.001 | |||
| male | 2,744 (39%) | 2,090 (35%) | 654 (59%) | |
| female | 4,245 (61%) | 3,798 (65%) | 447 (41%) | |
| status | <0.001 | |||
| 0 | 4,488 (64%) | 4,488 (76%) | 0 (0%) | |
| 1 | 2,501 (36%) | 1,400 (24%) | 1,101 (100%) | |
|
1
c("Mean (SD)", "Median (IQR)", "Range", "N missing (% missing %)"); n (%)
2
Wilcoxon rank sum test; Pearson's Chi-squared test
|
||||
As we said, just the variable Systolic.BP, referring to the systolic blood pressure, has Missing Values. The percentage of missing values is > 5%, indeed it has 20% of missing values. This implies that we cannot ignore them but instead we have to handle them.
Question 2: Missing Values
You will probably note that one variable has a non-negligible rate of missing values. Do you think that the missing values are completely at random, at random or not at random?
Let’s see some statistical descriptive about the distribution among missing and non missing.
#Adding a categorical variable indicating missing and non missing Systolic.BP values for each row
survey$systolic_test <- factor(ifelse(is.na(survey$Systolic.BP),"missing","non-missing"))
# Descriptive statistics of patients'characteristics by outcome
df = subset(survey, select=-Systolic.BP)
tbl_summary(df, by = systolic_test,
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c(
"{mean} ({sd})",
"{median} ({p25}, {p75})", #median with percentile
"{min}, {max}"),
missing = "no") %>%
add_overall() %>%
bold_labels()| Characteristic | Overall, N = 6,9891 | missing, N = 1,389 | non-missing, N = 5,600 |
|---|---|---|---|
| Age | |||
| Mean (SD) | 50 (16) | 69 (4) | 45 (14) |
| Median (IQR) | 49 (35, 66) | 69 (66, 72) | 43 (33, 56) |
| Range | 25, 74 | 39, 74 | 25, 74 |
| Diastolic.BP | |||
| Mean (SD) | 83 (13) | 87 (13) | 83 (13) |
| Median (IQR) | 82 (74, 90) | 86 (80, 94) | 80 (74, 90) |
| Range | 34, 180 | 50, 150 | 34, 180 |
| Poverty.index | |||
| Mean (SD) | 289 (220) | 266 (240) | 295 (214) |
| Median (IQR) | 238 (137, 373) | 190 (106, 316) | 248 (147, 384) |
| Range | 2, 999 | 11, 999 | 2, 999 |
| Race | |||
| White | 5,751 (82%) | 1,147 (83%) | 4,604 (82%) |
| afro-american | 1,158 (17%) | 227 (16%) | 931 (17%) |
| other | 80 (1.1%) | 15 (1.1%) | 65 (1.2%) |
| Red.blood.cells | |||
| Mean (SD) | 4.73 (0.49) | 4.74 (0.50) | 4.73 (0.48) |
| Median (IQR) | 4.71 (4.40, 5.05) | 4.73 (4.41, 5.06) | 4.71 (4.40, 5.04) |
| Range | 2.55, 8.88 | 2.55, 6.49 | 2.58, 8.88 |
| Sedimentation.rate | |||
| Mean (SD) | 16 (11) | 19 (13) | 15 (11) |
| Median (IQR) | 13 (7, 22) | 16 (9, 27) | 12 (7, 21) |
| Range | 1, 65 | 1, 65 | 1, 61 |
| Serum.Albumin | |||
| Mean (SD) | 4.37 (0.33) | 4.29 (0.31) | 4.39 (0.33) |
| Median (IQR) | 4.40 (4.20, 4.60) | 4.30 (4.10, 4.50) | 4.40 (4.20, 4.60) |
| Range | 2.70, 5.60 | 2.70, 5.50 | 2.90, 5.60 |
| Serum.Cholesterol | |||
| Mean (SD) | 222 (50) | 239 (49) | 218 (49) |
| Median (IQR) | 218 (188, 250) | 236 (206, 268) | 213 (185, 245) |
| Range | 53, 793 | 89, 498 | 53, 793 |
| Serum.Iron | |||
| Mean (SD) | 101 (37) | 101 (34) | 101 (37) |
| Median (IQR) | 96 (75, 121) | 97 (77, 118) | 95 (75, 121) |
| Range | 17, 396 | 28, 274 | 17, 396 |
| Serum.Magnesium | |||
| Mean (SD) | 1.68 (0.14) | 1.69 (0.15) | 1.68 (0.14) |
| Median (IQR) | 1.68 (1.59, 1.77) | 1.70 (1.60, 1.79) | 1.67 (1.59, 1.76) |
| Range | 0.82, 2.70 | 1.13, 2.50 | 0.82, 2.70 |
| Serum.Protein | |||
| Mean (SD) | 7.10 (0.50) | 7.09 (0.52) | 7.10 (0.50) |
| Median (IQR) | 7.10 (6.80, 7.40) | 7.10 (6.70, 7.40) | 7.10 (6.80, 7.40) |
| Range | 4.40, 11.50 | 5.30, 11.50 | 4.40, 9.80 |
| Sex | |||
| male | 2,744 (39%) | 640 (46%) | 2,104 (38%) |
| female | 4,245 (61%) | 749 (54%) | 3,496 (62%) |
| TIBC | |||
| Mean (SD) | 363 (58) | 345 (51) | 367 (59) |
| Median (IQR) | 356 (323, 396) | 342 (311, 375) | 360 (327, 399) |
| Range | 168, 691 | 168, 565 | 175, 691 |
| TS | |||
| Mean (SD) | 28 (11) | 30 (11) | 28 (11) |
| Median (IQR) | 27 (21, 34) | 28 (23, 35) | 27 (21, 34) |
| Range | 3, 100 | 5, 96 | 3, 100 |
| White.blood.cells | |||
| Mean (SD) | 7.46 (2.29) | 7.29 (2.72) | 7.50 (2.17) |
| Median (IQR) | 7.20 (6.00, 8.60) | 7.00 (5.70, 8.30) | 7.20 (6.10, 8.60) |
| Range | 2.30, 56.00 | 2.30, 56.00 | 2.60, 51.20 |
| BMI | |||
| Mean (SD) | 25.7 (5.1) | 26.0 (4.7) | 25.6 (5.2) |
| Median (IQR) | 24.9 (22.1, 28.3) | 25.6 (22.8, 28.6) | 24.8 (22.0, 28.2) |
| Range | 12.9, 72.2 | 12.9, 47.2 | 14.0, 72.2 |
| Pulse.pressure | |||
| Mean (SD) | 51 (18) | 65 (21) | 48 (16) |
| Median (IQR) | 48 (40, 60) | 62 (50, 76) | 45 (38, 56) |
| Range | 10, 150 | 14, 150 | 10, 150 |
| time_death | |||
| Mean (SD) | 16.3 (5.5) | 12.4 (6.2) | 17.2 (4.8) |
| Median (IQR) | 18.8 (13.8, 20.0) | 13.0 (7.2, 18.5) | 19.0 (16.4, 20.1) |
| Range | 0.0, 21.8 | 0.0, 21.6 | 0.0, 21.8 |
| status | |||
| 0 | 4,488 (64%) | 321 (23%) | 4,167 (74%) |
| 1 | 2,501 (36%) | 1,068 (77%) | 1,433 (26%) |
| death | |||
| 0 | 5,888 (84%) | 879 (63%) | 5,009 (89%) |
| 1 | 1,101 (16%) | 510 (37%) | 591 (11%) |
|
1
c("Mean (SD)", "Median (IQR)", "Range"); n (%)
|
|||
Which type of Missing do we have? The first approach is to make a graphical comparison between the observed and missing data in in each variable with boxplots and overlapping histograms for numeric variables. If the missing values are Missing Completely at Random we will expect that the distributions are identical, or at least comparable. Indeed in this case Missing data values do not relate to any other data in the dataset and there is no pattern to the actual values of the missing data themselves. If the distributions are different, we can have two cases:
- MAR: missing data do have a relationship with other variables in the dataset.But, the actual values that are missing are random.
- MNAR: The pattern of missingness is related to other variables in the dataset, but in addition, the values of the missing data are not random. Moreover the value of the variable that’s missing is related to the reason it’s missing.
#numeric variables
for (var in numeric_vars){
hist.p<-gghistogram(survey, x = var, y ="..density.." ,add = "median", bins = 20, rug = TRUE, fill = "systolic_test",add_density = TRUE, alpha = 0.5,palette = c("Orange", "#bb44f0")) +theme_minimal()
bx.p <- ggboxplot(survey, x="systolic_test", y=var, ylab = var, color ="systolic_test",palette = c("Orange", "#bb44f0"))+theme_minimal()
tot.p <- grid.arrange(bx.p,hist.p, ncol = 2)
}#categorical variable
barMR <- ggplot(data = survey, aes( x =Race, fill = systolic_test)) +
geom_bar(position = "fill") +
xlab("Race") +ylab("Relative Frequency")+
ggtitle("Missings per Race") +
scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()
barMS <- ggplot(data = survey, aes( x = Sex, fill = systolic_test)) +
geom_bar(position = "fill") +
xlab("Sex") +ylab("Relative Frequency")+
ggtitle("Missings per Sex") +
scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()
barMD <- ggplot(data = survey, aes( x = death, fill = systolic_test)) +
geom_bar(position = "fill") +
xlab("Death") +ylab("Relative Frequency")+
ggtitle("Missings per death") +
scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()
barMSt <- ggplot(data = survey, aes( x = status, fill = systolic_test)) +
geom_bar(position = "fill") +
xlab("status") +ylab("Relative Frequency")+
ggtitle("Missings per status") +
scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()
tot.p <- grid.arrange(barMR,barMS, barMSt, barMD, ncol = 2, nrow = 2) From these plots we can notice that missingness in Systolic.BP is not completely at random. On the contrary the missingness is related to other variables. They drastically differ by Age and are slightly related to higher values of Pulse.Pressure, Sedimentation.rate and lower value of Poverty.index. Moreover missigness is more present in patients that died within 10 years or experimented the event before the end of the study.
We can assess if the difference between observed and missing distributions is statistically significant performing Kruskal-Wallis test on continous data. In this tests:
- \(H_0\): indicates equivalence between the two distributions (two sample are from identical population)
- \(H_a\): the difference between distributions is statistically significant.
#kruskal-wallis
cont<- c("Age", "Pulse.pressure", "Sedimentation.rate", "Poverty.index")
categ<-c("death", "status", "Sex")
kruskal.test(Age ~ systolic_test, data = survey)##
## Kruskal-Wallis rank sum test
##
## data: Age by systolic_test
## Kruskal-Wallis chi-squared = 2527, df = 1, p-value < 2.2e-16
kruskal.test(Pulse.pressure ~ systolic_test, data = survey)##
## Kruskal-Wallis rank sum test
##
## data: Pulse.pressure by systolic_test
## Kruskal-Wallis chi-squared = 845.26, df = 1, p-value < 2.2e-16
kruskal.test(Sedimentation.rate ~ systolic_test, data = survey)##
## Kruskal-Wallis rank sum test
##
## data: Sedimentation.rate by systolic_test
## Kruskal-Wallis chi-squared = 117.1, df = 1, p-value < 2.2e-16
kruskal.test(Poverty.index ~ systolic_test, data = survey)##
## Kruskal-Wallis rank sum test
##
## data: Poverty.index by systolic_test
## Kruskal-Wallis chi-squared = 79.861, df = 1, p-value < 2.2e-16
For each test there is evidence to reject the null hypothesis since p-value << 0.05. Thus, we can conclude that there are differences between observed and missing distributions and so definitely missings are not MCAR. We have to choose between MAR and MNAR. MNAR would imply that there is a physiological reason that make the Systolic.BP measurement depend on Age, pressure or sedimentation, while from this test results only an association, relation. Since this hypothesis requires more clinical knowledge and we can’t have the opinion of a specialist, nevertheless we do not know how the study and population were carrried out and chosen, I choose to consider the missings as Missing at Random. We can only guess why missings are related to those independent variables. Maybe clinicians didn’t want to undergo severely ill old patients to other diagnostic tests Then we cannot ignore them but we have to find a way to handle these missings. Furthermore people without BP measures apparently have higher mortality rates (percentage of missings is higher in patients died within 10 years ) suggesting that eliminating the incomplete data will overestimate the true survival of the cohort.
Since for each patient we have other two measurement related to blood pressure, Pulse.Pressure and Diastolic pressure, we can assume that imputation of missing using that information could be a reasonable idea. But let’s verify if there is actually a correlation between these variables. We can plot graphically a heatmap based on Pearson correlation.
categoric_vars<-c('Race','Sex','status','death', 'time_death', 'systolic_test')
numeric_vars<- survey_trans %>% dplyr::select(-(all_of(categoric_vars))) %>% colnames()
numeric_varscormat<- round(x = cor(select(survey_trans, numeric_vars), use="complete.obs"), digits = 2)melted_cormat <- reshape2::melt(cormat)
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- reshape2::melt(upper_tri, na.rm = TRUE)
# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+ # minimal theme
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
# Print the heatmap
ggheatmap+ geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5)) As we guessed, the heatmap show the high correlation with Diastolic.BP and Pressure.BP. This suggest that imputation technique such as
MICE, which internally use regression model and require MAR type, would lead to reasonably accurate results.
MICE: Data imputation
MICE(Multiple Imputation by Chained Equations). A brief overview on the algorithm: It reapeats 4 steps till convergence. At each cycle missing data are imputed: Steps: 1. Mean Imputation 2. Choose the variable with fewer missings, build a regression model to impute that variable using the others as predictors. 3. Replace missings with predictions 4. Move to other variables with missings.
#used as predictors only the variables that have correlation > 0.25 with Systolic.BP
pred_mat <- quickpred(survey_trans, mincor = 0.25)
imp <- mice(survey_trans, m=10, seed = 1, predictorMatrix = pred_mat) #m = number of datasets generated#diagnostic plots
densityplot(imp)bwplot(imp) The
densityplot() shows that the imputed values of Systolic.BP are higher than the observed values.According to bwplot() all the dataset produced are equivalent, thus I choose the first.
survey_mice<-mice::complete(imp,1)
# See the difference using the mean
survey_mean<- survey_trans
survey_mean$Systolic.BP[is.na(survey_mean$Systolic.BP)] <- mean(survey_mean$Systolic.BP, na.rm = TRUE)
par(mfrow=c(1,2))
boxplot(survey_trans$Systolic.BP, survey_mice$Systolic.BP,
names = c("observed", "imputed"), col = c("orange","#bb44f0"), main = "MICE imputation")
boxplot(survey$Systolic.BP, survey_mean$Systolic.BP,
names = c("observed", "imputed"), col = c("orange","#bb44f0"), main = "Mean Imputation") The mean imputation seems to reflect better the observed distribution, however it doesn’t take into account the relationship with the other variables. Furthermore we saw that The missings in Systolic.BP are related to higher value of Pressure and Diastolic, thus since are strictly correlated the imputation values can be higher.
Question 3: Univariable Regression
Perform univariable regression analyses, of all candidate predictors for your model.
Which regression?
- The focus is on predicting the 10-year risk mortality not the time to event, thus the outcome variable is
death, a binary variable. - We do not have to concern about censoring, since the first patient censored is after 10,03 years.
cens<- ggplot(survey_mice, aes(x = time_death, fill = status)) +
geom_histogram(binwidth = 0.3, position = position_stack(reverse = TRUE))+xlab("Time to death(years)")+
scale_fill_manual(values=c("Orange", "LightGrey"))+ geom_vline(xintercept = 10, linetype="dashed")+
ggtitle("Censored patients")+
theme_classic()
censThus I decided to use Logistic Regression.
# note that here odds ratios are expressing variations by 1-unit on the time, in the summary function we had interquartile range as an effect
survey_mice %>% dplyr::select(-c(status, systolic_test)) %>%
tbl_uvregression(method = glm, # glm function
method.args = list(family = binomial),# logistic model
exponentiate = T, # report OR
y=death,# outcome variable
conf.level = 0.95,
pvalue_fun = ~style_pvalue(.x, digits = 3)) %>%
bold_labels() %>%
bold_p()| Characteristic | N | OR1 | 95% CI1 | p-value |
|---|---|---|---|---|
| Age | 6989 | 1.10 | 1.09, 1.11 | <0.001 |
| Diastolic.BP | 6989 | 1.03 | 1.02, 1.03 | <0.001 |
| Poverty.index.sqrt | 6989 | 0.95 | 0.94, 0.96 | <0.001 |
| Race | 6989 | |||
| White | — | — | ||
| afro-american | 1.45 | 1.23, 1.70 | <0.001 | |
| other | 1.44 | 0.80, 2.43 | 0.198 | |
| Red.blood.cells.log | 6989 | 1.72 | 0.92, 3.22 | 0.088 |
| Sedimentation.rate.log | 6989 | 1.52 | 1.40, 1.66 | <0.001 |
| Serum.Albumin | 6989 | 0.31 | 0.26, 0.38 | <0.001 |
| Serum.Cholesterol.log | 6989 | 2.76 | 2.06, 3.72 | <0.001 |
| Serum.Iron.sqrt | 6989 | 0.95 | 0.92, 0.98 | 0.005 |
| Serum.Magnesium | 6989 | 0.62 | 0.39, 0.97 | 0.036 |
| Serum.Protein | 6989 | 1.17 | 1.03, 1.32 | 0.018 |
| Sex | 6989 | |||
| male | — | — | ||
| female | 0.38 | 0.33, 0.43 | <0.001 | |
| Systolic.BP | 6989 | 1.03 | 1.02, 1.03 | <0.001 |
| TIBC | 6989 | 1.00 | 0.99, 1.00 | <0.001 |
| TS | 6989 | 1.00 | 1.00, 1.01 | 0.385 |
| White.blood.cells.log | 6989 | 1.33 | 1.05, 1.68 | 0.019 |
| BMI.log | 6989 | 1.19 | 0.84, 1.67 | 0.319 |
| Pulse.pressure.sqrt | 6989 | 1.73 | 1.64, 1.83 | <0.001 |
| time_death | 6989 | 0.00 | 0.00, Inf | 0.668 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
||||
Considering this univariable regression, assuming that linearity of the effect for each predictors, crude odds ratio result in:
- no effect between
Poverty.index,Serum.Cholesterol,Serum.Iron,TIBC,TS,BMIand the outcome - a protective effect for
Serum.Albumin,Serum.Magnesium,being female over male - greater risk with increase of
Age,Diastolic.BP,Red.blood.cells,Sedimentation.rate,Serum.Protein,Systolic.BP,White.blood.cells,Pulse.pressure
Let’s know explore if linearity is a valid assumption for the effect of each of the predictors:
# Visual exploration of the linearity of the effect for continuous predictor
par(mfrow = c(3,3))
plsmo(survey_mice$Age,survey_mice$death)
plsmo(survey_mice$Diastolic.BP,survey_mice$death) #little strange
plsmo((survey_mice$Poverty.index.sqrt),survey_mice$death) #strange
plsmo(survey_mice$Red.blood.cells.log,survey_mice$death)# U shape
plsmo((survey_mice$Sedimentation.rate.log),survey_mice$death)
plsmo(survey_mice$Serum.Albumin,survey_mice$death)
plsmo(survey_mice$Serum.Cholesterol.log,survey_mice$death)
plsmo((survey_mice$Serum.Iron.sqrt),survey_mice$death)# little stranve
plsmo(survey_mice$Serum.Magnesium,survey_mice$death) #little strangepar(mfrow = c(3,3))
plsmo(survey_mice$Serum.Protein,survey_mice$death) #strange
plsmo((survey_mice$Systolic.BP),survey_mice$death)
plsmo(survey_mice$TIBC,survey_mice$death)
plsmo(survey_mice$TS,survey_mice$death)
plsmo((survey_mice$White.blood.cells.log),survey_mice$death)# strange
plsmo(survey_mice$BMI.log,survey_mice$death)# strange
plsmo(survey_mice$Pulse.pressure.sqrt,survey_mice$death)
plsmo(survey_mice$time_death,survey_mice$death)
plsmo(survey_mice$Age,survey_mice$death)Let’s investigate if weird plots are associated with statistically significant non linear effect for the predictors. in order to check this, I used restricted cubic splines to interpolate non linear components and then test if these componenets are sttatistically significative looking at the p-value and also use the ANOVA, where \(H_0\) assumes linearity of the effect between log odds of outcome and predictors.
# non linearity between predictors and outcome
fit.splines.BMI <- lrm(death ~ rcs(BMI.log,4), data=survey_mice)
print(fit.splines.BMI)## Logistic Regression Model
##
## lrm(formula = death ~ rcs(BMI.log, 4), data = survey_mice)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 6989 LR chi2 20.20 R2 0.005 C 0.530
## 0 5888 d.f. 3 g 0.160 Dxy 0.059
## 1 1101 Pr(> chi2) 0.0002 gr 1.174 gamma 0.060
## max |deriv| 2e-12 gp 0.021 tau-a 0.016
## Brier 0.132
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 6.3983 2.0264 3.16 0.0016
## BMI.log -2.6999 0.6704 -4.03 <0.0001
## BMI.log' 10.2807 2.3560 4.36 <0.0001
## BMI.log'' -32.2254 7.7819 -4.14 <0.0001
##
summary(fit.splines.BMI)## Effects Response : death
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## BMI.log 3.0974 3.3433 0.24586 0.31535 0.08657 0.14568 0.48503
## Odds Ratio 3.0974 3.3433 0.24586 1.37070 NA 1.15680 1.62420
anova(fit.splines.BMI)## Wald Statistics Response: death
##
## Factor Chi-Square d.f. P
## BMI.log 20.51 3 1e-04
## Nonlinear 19.80 2 1e-04
## TOTAL 20.51 3 1e-04
# non linearity between predictors and outcome
fit.splines.TS <- lrm(death ~ rcs(TS,4), data=survey_mice)
#print(fit.splines.TS)
#summary(fit.splines.TS)# non linearity between predictors and outcome
fit.splines.Protein <- lrm(death ~ rcs(Serum.Protein,3), data=survey_mice)
#print(fit.splines.Protein)
#summary(fit.splines.Protein)
#anova(fit.splines.Protein)# non linearity between predictors and outcome
fit.splines.Mag <- lrm(death ~ rcs(Serum.Magnesium,3), data=survey_mice)
#print(fit.splines.Mag)
#summary(fit.splines.Mag)
#anova(fit.splines.Mag)# non linearity between predictors and outcome
fit.splines.Iron <- lrm(death ~ rcs(Serum.Iron.sqrt,6), data=survey_mice)
#print(fit.splines.Iron)
#summary(fit.splines.Iron)
#anova(fit.splines.Iron)# non linearity between predictors and outcome
fit.splines.Chol <- lrm(death ~ rcs(Serum.Cholesterol.log,4), data=survey_mice)
#print(fit.splines.Chol)
#summary(fit.splines.Chol)# non linearity between predictors and outcome
fit.splines.Red <- lrm(death ~ rcs(Red.blood.cells.log,3), data=survey_mice)
#print(fit.splines.Red)
#summary(fit.splines.Red)
#anova(fit.splines.Red)# non linearity between predictors and outcome
fit.splines.Poverty <- lrm(death ~ rcs(Poverty.index.sqrt,4), data=survey_mice)
#print(fit.splines.Poverty)
#summary(fit.splines.Poverty)
#anova(fit.splines.Poverty)# non linearity between predictors and outcome
fit.splines.Diastolic <- lrm(death ~ rcs(Diastolic.BP,4), data=survey_mice)
#print(fit.splines.Diastolic)
#summary(fit.splines.Diastolic)
#anova(fit.splines.Diastolic)# non linearity between predictors and outcome
fit.splines.Sys <- lrm(death ~ rcs(Systolic.BP,4), data=survey_mice)
#print(fit.splines.Sys)
#summary(fit.splines.Sys)
#anova(fit.splines.Sys)# non linearity between predictors and outcome
fit.splines.Pres <- lrm(death ~ rcs(Pulse.pressure.sqrt,4), data=survey_mice)
# print(fit.splines.Pres)
# summary(fit.splines.Pres)
# anova(fit.splines.Pres)# non linearity between predictors and outcome
fit.splines.TS <- lrm(death ~ rcs(TS,4), data=survey_mice)
# print(fit.splines.TS)
# summary(fit.splines.TS)
# anova(fit.splines.TS)From these tests result statistically significant the non-linear effect for: Poverty.index, Systolic.BP, Pressure.pulse, Diastolic.BP, Red.blood.cells,Iron,BMI, Magnesium
#stima effetto della variazione delle covariate sul rischio di morte (su scala log odds)
#stima effetto della variazione delle covariate sul rischio di morte (su scala log odds)
a<-ggplot(Predict(fit.splines.Poverty), colfill = "Orange")+theme_classic()
b<-ggplot(Predict(fit.splines.Sys),colfill = "Orange")+theme_classic()
c<-ggplot(Predict(fit.splines.Pres),colfill = "Orange")+theme_classic()
d<-ggplot(Predict(fit.splines.Diastolic),colfill = "Orange")+theme_classic()
e<-ggplot(Predict(fit.splines.Red),colfill = "Orange")+theme_classic()
f<-ggplot(Predict(fit.splines.Iron),colfill = "Orange")+theme_classic()
g<-ggplot(Predict(fit.splines.BMI),colfill = "Orange")+theme_classic()
h<-ggplot(Predict(fit.splines.Mag),colfill = "Orange")+theme_classic()
i<-ggplot(Predict(fit.splines.Protein), colfill = "Orange")+theme_classic()
tot.p <- grid.arrange(a,b,c,d,e,f,g,h, i, ncol = 3, nrow = 3) Question 4: Multivariable Regression
Build a multivariable model. Various selection procedures of the candidate predictors could be performed (try different approaches and compare them… we don’t have here the expert’s opinion, therefore the final choice will be based only on statistical evaluations). Comment on the effect estimated for the predictors retained in the final multivariable model
I fit several models using both implemented algorithms (BACKWARD, STEPWISE,HYBRID) and selection of predictors by hand looking at AIC, statistical significance, log-likelihood ratio test. I will consider here only the 4 best models.
Backward elimination
Backward by hand
# Full model considering the non linear effect
# Full model considering the non linear effect
fit.multi.lrm<-lrm(death ~ Age+rcs(Diastolic.BP,4)+rcs(Poverty.index.sqrt,4)+Race+ rcs(Red.blood.cells.log,3)+
Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ rcs(Serum.Iron.sqrt,6)+rcs(Serum.Magnesium,3)+rcs(Serum.Protein,3)+
Sex+TIBC+TS+White.blood.cells.log+rcs(BMI.log,4)+rcs(Pulse.pressure.sqrt,4) ,data = survey_mice, y=T, x=T)
print(fit.multi.lrm) #adjusted odds ratio## Logistic Regression Model
##
## lrm(formula = death ~ Age + rcs(Diastolic.BP, 4) + rcs(Poverty.index.sqrt,
## 4) + Race + rcs(Red.blood.cells.log, 3) + Sedimentation.rate.log +
## Serum.Albumin + Serum.Cholesterol.log + rcs(Serum.Iron.sqrt,
## 6) + rcs(Serum.Magnesium, 3) + rcs(Serum.Protein, 3) + Sex +
## TIBC + TS + White.blood.cells.log + rcs(BMI.log, 4) + rcs(Pulse.pressure.sqrt,
## 4), data = survey_mice, x = T, y = T)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 6989 LR chi2 1716.34 R2 0.374 C 0.858
## 0 5888 d.f. 33 g 1.986 Dxy 0.716
## 1 1101 Pr(> chi2) <0.0001 gr 7.287 gamma 0.716
## max |deriv| 6e-08 gp 0.190 tau-a 0.190
## Brier 0.098
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 9.9872 3.2930 3.03 0.0024
## Age 0.0894 0.0042 21.48 <0.0001
## Diastolic.BP -0.0208 0.0130 -1.60 0.1096
## Diastolic.BP' 0.0651 0.0434 1.50 0.1341
## Diastolic.BP'' -0.1457 0.1360 -1.07 0.2840
## Poverty.index.sqrt -0.0208 0.0291 -0.71 0.4749
## Poverty.index.sqrt' -0.1338 0.1376 -0.97 0.3307
## Poverty.index.sqrt'' 0.4932 0.4070 1.21 0.2255
## Race=afro-american 0.0011 0.1157 0.01 0.9924
## Race=other 0.4355 0.3476 1.25 0.2102
## Red.blood.cells.log -2.6123 0.7551 -3.46 0.0005
## Red.blood.cells.log' 3.5823 0.8772 4.08 <0.0001
## Sedimentation.rate.log 0.2099 0.0559 3.75 0.0002
## Serum.Albumin -0.5375 0.1477 -3.64 0.0003
## Serum.Cholesterol.log -0.1042 0.1919 -0.54 0.5871
## Serum.Iron.sqrt 0.1311 0.1379 0.95 0.3416
## Serum.Iron.sqrt' -3.4756 1.0553 -3.29 0.0010
## Serum.Iron.sqrt'' 23.5598 7.0367 3.35 0.0008
## Serum.Iron.sqrt''' -44.0801 14.5286 -3.03 0.0024
## Serum.Iron.sqrt'''' 33.5164 13.3426 2.51 0.0120
## Serum.Magnesium -1.5536 0.5097 -3.05 0.0023
## Serum.Magnesium' 0.9292 0.5702 1.63 0.1031
## Serum.Protein -0.0301 0.1807 -0.17 0.8677
## Serum.Protein' 0.2812 0.1823 1.54 0.1229
## Sex=female -1.0573 0.0898 -11.77 <0.0001
## TIBC 0.0028 0.0016 1.74 0.0814
## TS 0.0172 0.0187 0.92 0.3572
## White.blood.cells.log 0.7833 0.1502 5.22 <0.0001
## BMI.log -2.7273 0.8112 -3.36 0.0008
## BMI.log' 4.6291 2.7523 1.68 0.0926
## BMI.log'' -9.9192 9.0290 -1.10 0.2719
## Pulse.pressure.sqrt -0.1290 0.1687 -0.76 0.4446
## Pulse.pressure.sqrt' 0.3395 0.5979 0.57 0.5701
## Pulse.pressure.sqrt'' -0.4635 1.7710 -0.26 0.7935
##
s <- summary(fit.multi.lrm)
fit.multi.glm<-glm(death ~ Age+ns(Diastolic.BP,4)+ns(Poverty.index.sqrt,4)+Race+ ns(Red.blood.cells.log,3)+
Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ ns(Serum.Iron.sqrt,6)+ns(Serum.Magnesium,3)+ns(Serum.Protein,3)+
Sex+ns(Systolic.BP,4)+TIBC+TS+White.blood.cells.log+ns(BMI.log,4)+ns(Pulse.pressure.sqrt,3) ,family = binomial, data = survey_mice)We should consider the correlation between variables. According to heatmap made we have variables that are highly corralted such as Systolic.BP, Diastolic.BP,PressurePulse. In this case we can have an issue of collinearity to take into account. We can test it using VIF(Variance Inflation factor). The smallest value for VIF is 1, which indicates complete absence of multicollinearity. As a rule of thumb, a VIF vaue that exceeds 5 or 10 indicates a problematic amount of collinearity.
vif_values <- vif(fit.multi.lrm)
vif_values## Age Diastolic.BP Diastolic.BP'
## 1.428483 22.837245 260.245029
## Diastolic.BP'' Poverty.index.sqrt Poverty.index.sqrt'
## 153.584416 22.309495 301.220465
## Poverty.index.sqrt'' Race=afro-american Race=other
## 188.119937 1.393221 1.022300
## Red.blood.cells.log Red.blood.cells.log' Sedimentation.rate.log
## 4.821616 4.520868 1.524514
## Serum.Albumin Serum.Cholesterol.log Serum.Iron.sqrt
## 1.546356 1.154070 39.247834
## Serum.Iron.sqrt' Serum.Iron.sqrt'' Serum.Iron.sqrt'''
## 2548.466712 17059.507125 17024.966107
## Serum.Iron.sqrt'''' Serum.Magnesium Serum.Magnesium'
## 2055.354192 4.227201 4.067548
## Serum.Protein Serum.Protein' Sex=female
## 6.032579 5.254931 1.369111
## TIBC TS White.blood.cells.log
## 5.199133 29.506607 1.125166
## BMI.log BMI.log' BMI.log''
## 16.102262 165.680213 98.447754
## Pulse.pressure.sqrt Pulse.pressure.sqrt' Pulse.pressure.sqrt''
## 32.935570 428.295194 254.721086
fit.multi.glm %>%
tbl_regression(exponentiate = TRUE) %>%
bold_labels() %>%
bold_p %>%
add_vif() #adjusted GVIF should be < 2| Characteristic | OR1 | 95% CI1 | p-value | GVIF1 | Adjusted GVIF2,1 |
|---|---|---|---|---|---|
| Age | 1.09 | 1.08, 1.10 | <0.001 | 1.5 | 1.2 |
| ns(Diastolic.BP, 4) | 208 | 1.9 | |||
| ns(Diastolic.BP, 4)1 | 0.04 | 0.00, 1.12 | 0.057 | ||
| ns(Diastolic.BP, 4)2 | 0.03 | 0.00, 1.76 | 0.092 | ||
| ns(Diastolic.BP, 4)3 | 0.00 | 0.00, 9.75 | 0.15 | ||
| ns(Diastolic.BP, 4)4 | 0.03 | 0.00, 31.7 | 0.3 | ||
| ns(Poverty.index.sqrt, 4) | 1.2 | 1.0 | |||
| ns(Poverty.index.sqrt, 4)1 | 0.86 | 0.40, 1.85 | 0.7 | ||
| ns(Poverty.index.sqrt, 4)2 | 0.62 | 0.34, 1.14 | 0.12 | ||
| ns(Poverty.index.sqrt, 4)3 | 1.09 | 0.18, 6.96 | >0.9 | ||
| ns(Poverty.index.sqrt, 4)4 | 0.62 | 0.40, 0.96 | 0.034 | ||
| Race | 1.4 | 1.1 | |||
| White | — | — | |||
| afro-american | 1.00 | 0.80, 1.26 | >0.9 | ||
| other | 1.59 | 0.79, 3.09 | 0.2 | ||
| ns(Red.blood.cells.log, 3) | 1.7 | 1.1 | |||
| ns(Red.blood.cells.log, 3)1 | 0.29 | 0.12, 0.67 | 0.004 | ||
| ns(Red.blood.cells.log, 3)2 | 0.19 | 0.00, 9.36 | 0.4 | ||
| ns(Red.blood.cells.log, 3)3 | 11.3 | 0.88, 165 | 0.065 | ||
| Sedimentation.rate.log | 1.22 | 1.09, 1.37 | <0.001 | 1.5 | 1.2 |
| Serum.Albumin | 0.57 | 0.43, 0.76 | <0.001 | 1.6 | 1.3 |
| Serum.Cholesterol.log | 0.92 | 0.63, 1.34 | 0.6 | 1.2 | 1.1 |
| ns(Serum.Iron.sqrt, 6) | 31 | 1.3 | |||
| ns(Serum.Iron.sqrt, 6)1 | 1.50 | 0.39, 6.36 | 0.6 | ||
| ns(Serum.Iron.sqrt, 6)2 | 1.09 | 0.23, 5.72 | >0.9 | ||
| ns(Serum.Iron.sqrt, 6)3 | 1.59 | 0.33, 8.39 | 0.6 | ||
| ns(Serum.Iron.sqrt, 6)4 | 0.25 | 0.04, 1.66 | 0.15 | ||
| ns(Serum.Iron.sqrt, 6)5 | 4.84 | 0.08, 344 | 0.5 | ||
| ns(Serum.Iron.sqrt, 6)6 | 2.45 | 0.03, 197 | 0.7 | ||
| ns(Serum.Magnesium, 3) | 1.1 | 1.0 | |||
| ns(Serum.Magnesium, 3)1 | 0.39 | 0.17, 0.93 | 0.029 | ||
| ns(Serum.Magnesium, 3)2 | 1.60 | 0.05, 63.5 | 0.8 | ||
| ns(Serum.Magnesium, 3)3 | 2.25 | 0.36, 13.1 | 0.4 | ||
| ns(Serum.Protein, 3) | 1.7 | 1.1 | |||
| ns(Serum.Protein, 3)1 | 1.23 | 0.47, 3.25 | 0.7 | ||
| ns(Serum.Protein, 3)2 | 0.14 | 0.00, 11.1 | 0.4 | ||
| ns(Serum.Protein, 3)3 | 0.86 | 0.04, 29.1 | >0.9 | ||
| Sex | 1.4 | 1.2 | |||
| male | — | — | |||
| female | 0.34 | 0.29, 0.41 | <0.001 | ||
| ns(Systolic.BP, 4) | 1,432 | 2.5 | |||
| ns(Systolic.BP, 4)1 | 14.0 | 0.60, 334 | 0.10 | ||
| ns(Systolic.BP, 4)2 | 120 | 0.88, 16,835 | 0.057 | ||
| ns(Systolic.BP, 4)3 | 1,347 | 0.14, 13,912,837 | 0.12 | ||
| ns(Systolic.BP, 4)4 | 358 | 0.09, 1,565,476 | 0.2 | ||
| TIBC | 1.00 | 1.00, 1.01 | 0.084 | 5.4 | 2.3 |
| TS | 1.02 | 0.98, 1.05 | 0.4 | 30 | 5.5 |
| White.blood.cells.log | 2.18 | 1.62, 2.93 | <0.001 | 1.1 | 1.1 |
| ns(BMI.log, 4) | 1.4 | 1.0 | |||
| ns(BMI.log, 4)1 | 0.16 | 0.04, 0.56 | 0.005 | ||
| ns(BMI.log, 4)2 | 0.37 | 0.15, 0.91 | 0.032 | ||
| ns(BMI.log, 4)3 | 0.05 | 0.00, 0.87 | 0.042 | ||
| ns(BMI.log, 4)4 | 0.31 | 0.04, 2.00 | 0.2 | ||
| ns(Pulse.pressure.sqrt, 3) | 507 | 2.8 | |||
| ns(Pulse.pressure.sqrt, 3)1 | 0.03 | 0.00, 0.87 | 0.041 | ||
| ns(Pulse.pressure.sqrt, 3)2 | 0.00 | 0.00, 3.09 | 0.10 | ||
| ns(Pulse.pressure.sqrt, 3)3 | 0.02 | 0.00, 8.14 | 0.2 | ||
|
1
OR = Odds Ratio, CI = Confidence Interval, GVIF = Generalized Variance Inflation Factor
2
GVIF^[1/(2*df)]
|
|||||
As we expected there is a problem of collinearity between three measurements referred to blood pressure. Let’s drop out of Systolic.BP and TS since have the greatest VIF. Then compare the model with ANOVA, where \(H_0\) indicates that simpler model is better.
#removed predictor with highest VIF: TS
fit.multi.glm_ts<-glm(death ~ Age+ns(Diastolic.BP,4)+ns(Poverty.index.sqrt,3)+Race+ ns(Red.blood.cells.log,4)+
Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ ns(Serum.Iron.sqrt,6)+ns(Serum.Magnesium,3)+Serum.Protein+
Sex+ns(Systolic.BP,4)+TIBC+White.blood.cells.log+ns(BMI.log,4)+ns(Pulse.pressure.sqrt,3) ,family = binomial, data = survey_mice)
fit.multi.glm_ts##
## Call: glm(formula = death ~ Age + ns(Diastolic.BP, 4) + ns(Poverty.index.sqrt,
## 3) + Race + ns(Red.blood.cells.log, 4) + Sedimentation.rate.log +
## Serum.Albumin + Serum.Cholesterol.log + ns(Serum.Iron.sqrt,
## 6) + ns(Serum.Magnesium, 3) + Serum.Protein + Sex + ns(Systolic.BP,
## 4) + TIBC + White.blood.cells.log + ns(BMI.log, 4) + ns(Pulse.pressure.sqrt,
## 3), family = binomial, data = survey_mice)
##
## Coefficients:
## (Intercept) Age
## -0.981747 0.089018
## ns(Diastolic.BP, 4)1 ns(Diastolic.BP, 4)2
## -3.356232 -3.618972
## ns(Diastolic.BP, 4)3 ns(Diastolic.BP, 4)4
## -6.597642 -3.558118
## ns(Poverty.index.sqrt, 3)1 ns(Poverty.index.sqrt, 3)2
## -0.708969 -0.529594
## ns(Poverty.index.sqrt, 3)3 Raceafro-american
## -0.499937 0.008943
## Raceother ns(Red.blood.cells.log, 4)1
## 0.461441 -2.044900
## ns(Red.blood.cells.log, 4)2 ns(Red.blood.cells.log, 4)3
## -1.194556 -1.825077
## ns(Red.blood.cells.log, 4)4 Sedimentation.rate.log
## 2.275725 0.202774
## Serum.Albumin Serum.Cholesterol.log
## -0.574562 -0.096457
## ns(Serum.Iron.sqrt, 6)1 ns(Serum.Iron.sqrt, 6)2
## 0.612242 0.365352
## ns(Serum.Iron.sqrt, 6)3 ns(Serum.Iron.sqrt, 6)4
## 0.816055 -0.651667
## ns(Serum.Iron.sqrt, 6)5 ns(Serum.Iron.sqrt, 6)6
## 2.831390 2.525582
## ns(Serum.Magnesium, 3)1 ns(Serum.Magnesium, 3)2
## -0.947189 0.441262
## ns(Serum.Magnesium, 3)3 Serum.Protein
## 0.803462 0.218432
## Sexfemale ns(Systolic.BP, 4)1
## -1.060542 2.647317
## ns(Systolic.BP, 4)2 ns(Systolic.BP, 4)3
## 4.818476 7.258152
## ns(Systolic.BP, 4)4 TIBC
## 5.949845 0.001558
## White.blood.cells.log ns(BMI.log, 4)1
## 0.777769 -1.816743
## ns(BMI.log, 4)2 ns(BMI.log, 4)3
## -0.985700 -2.922636
## ns(BMI.log, 4)4 ns(Pulse.pressure.sqrt, 3)1
## -1.131883 -3.388004
## ns(Pulse.pressure.sqrt, 3)2 ns(Pulse.pressure.sqrt, 3)3
## -5.741128 -4.075939
##
## Degrees of Freedom: 6988 Total (i.e. Null); 6947 Residual
## Null Deviance: 6088
## Residual Deviance: 4362 AIC: 4446
anova( fit.multi.glm_ts, fit.multi.glm, test="LRT")## Analysis of Deviance Table
##
## Model 1: death ~ Age + ns(Diastolic.BP, 4) + ns(Poverty.index.sqrt, 3) +
## Race + ns(Red.blood.cells.log, 4) + Sedimentation.rate.log +
## Serum.Albumin + Serum.Cholesterol.log + ns(Serum.Iron.sqrt,
## 6) + ns(Serum.Magnesium, 3) + Serum.Protein + Sex + ns(Systolic.BP,
## 4) + TIBC + White.blood.cells.log + ns(BMI.log, 4) + ns(Pulse.pressure.sqrt,
## 3)
## Model 2: death ~ Age + ns(Diastolic.BP, 4) + ns(Poverty.index.sqrt, 4) +
## Race + ns(Red.blood.cells.log, 3) + Sedimentation.rate.log +
## Serum.Albumin + Serum.Cholesterol.log + ns(Serum.Iron.sqrt,
## 6) + ns(Serum.Magnesium, 3) + ns(Serum.Protein, 3) + Sex +
## ns(Systolic.BP, 4) + TIBC + TS + White.blood.cells.log +
## ns(BMI.log, 4) + ns(Pulse.pressure.sqrt, 3)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 6947 4362.1
## 2 6944 4357.3 3 4.8264 0.185
# Log likelihood ratio test: H_0 simpler model captuers variance better, keep it, H_a prefer more complex model
#If the resulting p-value is sufficiently low (usually less than 0.05), we conclude that the more complex model
#is significantly better than the simpler modelAccording to test (ANOVA, smaller AIC) we should consider the simpler model.
Considering ANOVA results, reduction in AIC, VIF and statistical significance of coefficients I ended up with this model, that is the full model without Systolic.BP, TS, Race.
fit.multi.glm_res<-glm(death ~ Age+ns(Diastolic.BP,4)+ns(Poverty.index.sqrt,3)+ns(Red.blood.cells.log,4)+
Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ ns(Serum.Iron.sqrt,6)+ns(Serum.Magnesium,3)+ns(Serum.Protein,3)+
Sex+TIBC+White.blood.cells.log+ns(BMI.log,4)+ns(Pulse.pressure.sqrt,3) ,family = binomial, data = survey_mice)
anova( fit.multi.glm_ts, fit.multi.glm, test="LRT")## Analysis of Deviance Table
##
## Model 1: death ~ Age + ns(Diastolic.BP, 4) + ns(Poverty.index.sqrt, 3) +
## Race + ns(Red.blood.cells.log, 4) + Sedimentation.rate.log +
## Serum.Albumin + Serum.Cholesterol.log + ns(Serum.Iron.sqrt,
## 6) + ns(Serum.Magnesium, 3) + Serum.Protein + Sex + ns(Systolic.BP,
## 4) + TIBC + White.blood.cells.log + ns(BMI.log, 4) + ns(Pulse.pressure.sqrt,
## 3)
## Model 2: death ~ Age + ns(Diastolic.BP, 4) + ns(Poverty.index.sqrt, 4) +
## Race + ns(Red.blood.cells.log, 3) + Sedimentation.rate.log +
## Serum.Albumin + Serum.Cholesterol.log + ns(Serum.Iron.sqrt,
## 6) + ns(Serum.Magnesium, 3) + ns(Serum.Protein, 3) + Sex +
## ns(Systolic.BP, 4) + TIBC + TS + White.blood.cells.log +
## ns(BMI.log, 4) + ns(Pulse.pressure.sqrt, 3)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 6947 4362.1
## 2 6944 4357.3 3 4.8264 0.185
fit.multi.lrm_res<-lrm(death ~ Age+Poverty.index.sqrt+rcs(Red.blood.cells.log,4)+
Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ rcs(Serum.Iron.sqrt,6)+I(Serum.Magnesium^2)+Serum.Protein+
Sex+TIBC+White.blood.cells.log+rcs(BMI.log,3)+Pulse.pressure.sqrt^2 ,data = survey_mice, y=T, x=T)
summary(fit.multi.lrm_res)## Effects Response : death
##
## Factor Low High Diff. Effect S.E.
## Age 35.0000 66.0000 31.00000 2.737500 0.124690
## Odds Ratio 35.0000 66.0000 31.00000 15.448000 NA
## Poverty.index.sqrt 11.7050 19.3130 7.60850 -0.235200 0.049413
## Odds Ratio 11.7050 19.3130 7.60850 0.790410 NA
## Red.blood.cells.log 1.4816 1.6194 0.13778 -0.055147 0.109740
## Odds Ratio 1.4816 1.6194 0.13778 0.946350 NA
## Sedimentation.rate.log 1.9459 3.0910 1.14510 0.253900 0.063756
## Odds Ratio 1.9459 3.0910 1.14510 1.289000 NA
## Serum.Albumin 4.2000 4.6000 0.40000 -0.245050 0.057647
## Odds Ratio 4.2000 4.6000 0.40000 0.782670 NA
## Serum.Cholesterol.log 5.2364 5.5215 0.28502 -0.037934 0.054452
## Odds Ratio 5.2364 5.5215 0.28502 0.962780 NA
## Serum.Iron.sqrt 8.6603 11.0000 2.33970 -0.219010 0.102470
## Odds Ratio 8.6603 11.0000 2.33970 0.803310 NA
## Serum.Magnesium 1.5900 1.7700 0.18000 -0.144290 0.046492
## Odds Ratio 1.5900 1.7700 0.18000 0.865640 NA
## Serum.Protein 6.8000 7.4000 0.60000 0.157000 0.052595
## Odds Ratio 6.8000 7.4000 0.60000 1.170000 NA
## TIBC 323.0000 396.0000 73.00000 0.117080 0.055530
## Odds Ratio 323.0000 396.0000 73.00000 1.124200 NA
## White.blood.cells.log 1.7918 2.1518 0.36000 0.274240 0.052523
## Odds Ratio 1.7918 2.1518 0.36000 1.315500 NA
## BMI.log 3.0974 3.3433 0.24586 -0.190410 0.054730
## Odds Ratio 3.0974 3.3433 0.24586 0.826620 NA
## Pulse.pressure.sqrt 6.3246 7.7460 1.42140 0.155690 0.047212
## Odds Ratio 6.3246 7.7460 1.42140 1.168500 NA
## Sex - male:female 2.0000 1.0000 NA 1.074500 0.089690
## Odds Ratio 2.0000 1.0000 NA 2.928700 NA
## Lower 0.95 Upper 0.95
## 2.4931000 2.981900
## 12.0990000 19.725000
## -0.3320500 -0.138360
## 0.7174500 0.870790
## -0.2702200 0.159930
## 0.7632100 1.173400
## 0.1289400 0.378860
## 1.1376000 1.460600
## -0.3580300 -0.132060
## 0.6990500 0.876290
## -0.1446600 0.068790
## 0.8653200 1.071200
## -0.4198400 -0.018178
## 0.6571500 0.981990
## -0.2354100 -0.053165
## 0.7902500 0.948220
## 0.0539180 0.260090
## 1.0554000 1.297000
## 0.0082415 0.225920
## 1.0083000 1.253500
## 0.1713000 0.377190
## 1.1868000 1.458200
## -0.2976800 -0.083139
## 0.7425400 0.920220
## 0.0631560 0.248220
## 1.0652000 1.281700
## 0.8987600 1.250300
## 2.4566000 3.491500
fit.multi.lrm_res## Logistic Regression Model
##
## lrm(formula = death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log,
## 4) + Sedimentation.rate.log + Serum.Albumin + Serum.Cholesterol.log +
## rcs(Serum.Iron.sqrt, 6) + I(Serum.Magnesium^2) + Serum.Protein +
## Sex + TIBC + White.blood.cells.log + rcs(BMI.log, 3) + Pulse.pressure.sqrt^2,
## data = survey_mice, x = T, y = T)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 6989 LR chi2 1680.79 R2 0.368 C 0.855
## 0 5888 d.f. 21 g 1.953 Dxy 0.709
## 1 1101 Pr(> chi2) <0.0001 gr 7.047 gamma 0.710
## max |deriv| 3e-08 gp 0.188 tau-a 0.188
## Brier 0.098
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 2.6977 2.3308 1.16 0.2471
## Age 0.0883 0.0040 21.95 <0.0001
## Poverty.index.sqrt -0.0309 0.0065 -4.76 <0.0001
## Red.blood.cells.log -2.5337 1.0658 -2.38 0.0174
## Red.blood.cells.log' 2.0533 3.1998 0.64 0.5211
## Red.blood.cells.log'' 6.3288 14.1235 0.45 0.6541
## Sedimentation.rate.log 0.2217 0.0557 3.98 <0.0001
## Serum.Albumin -0.6126 0.1441 -4.25 <0.0001
## Serum.Cholesterol.log -0.1331 0.1910 -0.70 0.4860
## Serum.Iron.sqrt 0.2151 0.1264 1.70 0.0888
## Serum.Iron.sqrt' -3.5726 1.0361 -3.45 0.0006
## Serum.Iron.sqrt'' 24.3320 6.9831 3.48 0.0005
## Serum.Iron.sqrt''' -45.7553 14.4523 -3.17 0.0015
## Serum.Iron.sqrt'''' 35.4162 13.2701 2.67 0.0076
## Serum.Magnesium -0.2386 0.0769 -3.10 0.0019
## Serum.Protein 0.2617 0.0877 2.99 0.0028
## Sex=female -1.0745 0.0897 -11.98 <0.0001
## TIBC 0.0016 0.0008 2.11 0.0350
## White.blood.cells.log 0.7618 0.1459 5.22 <0.0001
## BMI.log -2.3257 0.4892 -4.75 <0.0001
## BMI.log' 2.3391 0.5656 4.14 <0.0001
## Pulse.pressure.sqrt 0.1095 0.0332 3.30 0.0010
##
anova(fit.multi.lrm_res)## Wald Statistics Response: death
##
## Factor Chi-Square d.f. P
## Age 482.01 1 <.0001
## Poverty.index.sqrt 22.66 1 <.0001
## Red.blood.cells.log 21.16 3 0.0001
## Nonlinear 21.11 2 <.0001
## Sedimentation.rate.log 15.86 1 0.0001
## Serum.Albumin 18.07 1 <.0001
## Serum.Cholesterol.log 0.49 1 0.4860
## Serum.Iron.sqrt 21.63 5 0.0006
## Nonlinear 18.43 4 0.0010
## Serum.Magnesium 9.63 1 0.0019
## Serum.Protein 8.91 1 0.0028
## Sex 143.54 1 <.0001
## TIBC 4.45 1 0.0350
## White.blood.cells.log 27.26 1 <.0001
## BMI.log 22.71 2 <.0001
## Nonlinear 17.10 1 <.0001
## Pulse.pressure.sqrt 10.87 1 0.0010
## TOTAL NONLINEAR 58.00 7 <.0001
## TOTAL 1014.11 21 <.0001
Backward algorithm: fastbw()
#### BACKWARD ELIMINATION ALGORITHM
fastbw(fit.multi.lrm)##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC
## Race 1.58 2 0.4549 1.58 2 0.4549 -2.42
## Serum.Cholesterol.log 0.32 1 0.5699 1.90 3 0.5938 -4.10
## TS 0.86 1 0.3533 2.76 4 0.5988 -5.24
## TIBC 3.68 1 0.0550 6.44 5 0.2655 -3.56
##
## Approximate Estimates after Deleting Factors
##
## Coef S.E. Wald Z P
## Intercept 10.43670 3.199911 3.26156 1.108e-03
## Age 0.08741 0.004001 21.84560 0.000e+00
## Diastolic.BP -0.02094 0.013008 -1.60996 1.074e-01
## Diastolic.BP' 0.06592 0.043402 1.51892 1.288e-01
## Diastolic.BP'' -0.14817 0.135936 -1.09002 2.757e-01
## Poverty.index.sqrt -0.01854 0.028792 -0.64400 5.196e-01
## Poverty.index.sqrt' -0.14524 0.137256 -1.05817 2.900e-01
## Poverty.index.sqrt'' 0.52641 0.406531 1.29489 1.954e-01
## Red.blood.cells.log -2.69156 0.749768 -3.58986 3.309e-04
## Red.blood.cells.log' 3.70029 0.873306 4.23711 2.264e-05
## Sedimentation.rate.log 0.20586 0.055632 3.70039 2.153e-04
## Serum.Albumin -0.51157 0.143972 -3.55327 3.805e-04
## Serum.Iron.sqrt 0.14783 0.125163 1.18106 2.376e-01
## Serum.Iron.sqrt' -3.15297 1.035712 -3.04426 2.333e-03
## Serum.Iron.sqrt'' 22.46413 7.003681 3.20747 1.339e-03
## Serum.Iron.sqrt''' -43.08006 14.515197 -2.96793 2.998e-03
## Serum.Iron.sqrt'''' 33.41291 13.336965 2.50529 1.224e-02
## Serum.Magnesium -1.56088 0.504763 -3.09231 1.986e-03
## Serum.Magnesium' 0.92668 0.567957 1.63160 1.028e-01
## Serum.Protein -0.01414 0.179638 -0.07871 9.373e-01
## Serum.Protein' 0.27909 0.181994 1.53353 1.251e-01
## Sex=female -1.04347 0.087320 -11.95003 0.000e+00
## White.blood.cells.log 0.79061 0.146870 5.38304 7.324e-08
## BMI.log -2.66486 0.807979 -3.29819 9.731e-04
## BMI.log' 4.43223 2.747111 1.61341 1.067e-01
## BMI.log'' -9.27341 9.013034 -1.02889 3.035e-01
## Pulse.pressure.sqrt -0.13758 0.168511 -0.81645 4.142e-01
## Pulse.pressure.sqrt' 0.37838 0.597317 0.63346 5.264e-01
## Pulse.pressure.sqrt'' -0.56603 1.769770 -0.31983 7.491e-01
##
## Factors in Final Model
##
## [1] Age Diastolic.BP Poverty.index.sqrt
## [4] Red.blood.cells.log Sedimentation.rate.log Serum.Albumin
## [7] Serum.Iron.sqrt Serum.Magnesium Serum.Protein
## [10] Sex White.blood.cells.log BMI.log
## [13] Pulse.pressure.sqrt
#After several tests redcing the non linearity of Poverty and Iron we obtain a better model
fit.bw1<- lrm(death~Age+ Poverty.index.sqrt+rcs(Red.blood.cells.log,3)+
Sedimentation.rate.log+Serum.Albumin+rcs(Serum.Magnesium,3)+Systolic.BP+Serum.Protein+Sex+rcs(BMI.log,3)+
rcs(Serum.Iron.sqrt,6)+White.blood.cells.log, data = survey_mice, y = TRUE, x=TRUE)
fit.bw1## Logistic Regression Model
##
## lrm(formula = death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log,
## 3) + Sedimentation.rate.log + Serum.Albumin + rcs(Serum.Magnesium,
## 3) + Systolic.BP + Serum.Protein + Sex + rcs(BMI.log, 3) +
## rcs(Serum.Iron.sqrt, 6) + White.blood.cells.log, data = survey_mice,
## x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 6989 LR chi2 1690.00 R2 0.369 C 0.856
## 0 5888 d.f. 19 g 1.980 Dxy 0.712
## 1 1101 Pr(> chi2) <0.0001 gr 7.240 gamma 0.712
## max |deriv| 1e-08 gp 0.189 tau-a 0.189
## Brier 0.098
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 5.4151 2.1170 2.56 0.0105
## Age 0.0862 0.0038 22.58 <0.0001
## Poverty.index.sqrt -0.0295 0.0065 -4.55 <0.0001
## Red.blood.cells.log -2.8444 0.7427 -3.83 0.0001
## Red.blood.cells.log' 3.8544 0.8660 4.45 <0.0001
## Sedimentation.rate.log 0.2081 0.0553 3.76 0.0002
## Serum.Albumin -0.5789 0.1418 -4.08 <0.0001
## Serum.Magnesium -1.6249 0.5016 -3.24 0.0012
## Serum.Magnesium' 1.0067 0.5631 1.79 0.0738
## Systolic.BP 0.0077 0.0016 4.70 <0.0001
## Serum.Protein 0.2456 0.0878 2.80 0.0052
## Sex=female -1.0500 0.0864 -12.15 <0.0001
## BMI.log -2.4128 0.4862 -4.96 <0.0001
## BMI.log' 2.2768 0.5638 4.04 <0.0001
## Serum.Iron.sqrt 0.1596 0.1261 1.27 0.2054
## Serum.Iron.sqrt' -3.2556 1.0373 -3.14 0.0017
## Serum.Iron.sqrt'' 23.0333 6.9910 3.29 0.0010
## Serum.Iron.sqrt''' -44.1589 14.4662 -3.05 0.0023
## Serum.Iron.sqrt'''' 34.4816 13.2855 2.60 0.0094
## White.blood.cells.log 0.7798 0.1460 5.34 <0.0001
##
Univariable filtering
#Consider the predictor statistically significant with p < 0.05 from the univariate models
predictor <- c("Age", "Diastolic.BP", "Poverty.index.sqrt",
"Sedimentation.rate.log","Serum.Albumin", "Serum.Cholesterol.log", "Serum.Iron.sqrt ","Serum.Magnesium",
"Sex", "Systolic.BP","White.blood.cells.log", "Pulse.pressure.sqrt")
fit.univ<-lrm(death ~ rcs(Diastolic.BP,4)+rcs(Systolic.BP,4)+Age+rcs(Poverty.index.sqrt,3)+
Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ rcs(Serum.Iron.sqrt,6)+ rcs(Serum.Magnesium,3)+
Sex+White.blood.cells.log+rcs(Pulse.pressure.sqrt,3) ,data = survey_mice, y=T, x=T)
fit.univ## Logistic Regression Model
##
## lrm(formula = death ~ rcs(Diastolic.BP, 4) + rcs(Systolic.BP,
## 4) + Age + rcs(Poverty.index.sqrt, 3) + Sedimentation.rate.log +
## Serum.Albumin + Serum.Cholesterol.log + rcs(Serum.Iron.sqrt,
## 6) + rcs(Serum.Magnesium, 3) + Sex + White.blood.cells.log +
## rcs(Pulse.pressure.sqrt, 3), data = survey_mice, x = T, y = T)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 6989 LR chi2 1661.59 R2 0.364 C 0.854
## 0 5888 d.f. 23 g 1.964 Dxy 0.708
## 1 1101 Pr(> chi2) <0.0001 gr 7.125 gamma 0.708
## max |deriv| 1e-08 gp 0.188 tau-a 0.188
## Brier 0.099
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 3.0360 2.2524 1.35 0.1777
## Diastolic.BP -0.0606 0.0292 -2.08 0.0378
## Diastolic.BP' 0.0639 0.0455 1.40 0.1609
## Diastolic.BP'' -0.1075 0.1417 -0.76 0.4479
## Systolic.BP 0.0292 0.0289 1.01 0.3111
## Systolic.BP' 0.0440 0.0478 0.92 0.3567
## Systolic.BP'' -0.1277 0.1089 -1.17 0.2410
## Age 0.0870 0.0040 21.68 <0.0001
## Poverty.index.sqrt -0.0611 0.0157 -3.90 <0.0001
## Poverty.index.sqrt' 0.0400 0.0191 2.10 0.0360
## Sedimentation.rate.log 0.2188 0.0505 4.33 <0.0001
## Serum.Albumin -0.4193 0.1267 -3.31 0.0009
## Serum.Cholesterol.log -0.2179 0.1886 -1.16 0.2481
## Serum.Iron.sqrt 0.0650 0.1233 0.53 0.5978
## Serum.Iron.sqrt' -2.7593 1.0253 -2.69 0.0071
## Serum.Iron.sqrt'' 21.0484 6.9458 3.03 0.0024
## Serum.Iron.sqrt''' -42.0525 14.4064 -2.92 0.0035
## Serum.Iron.sqrt'''' 33.9359 13.2416 2.56 0.0104
## Serum.Magnesium -1.7224 0.5002 -3.44 0.0006
## Serum.Magnesium' 1.0793 0.5618 1.92 0.0547
## Sex=female -1.0430 0.0830 -12.56 <0.0001
## White.blood.cells.log 0.7274 0.1439 5.05 <0.0001
## Pulse.pressure.sqrt -0.6130 0.3326 -1.84 0.0653
## Pulse.pressure.sqrt' 0.2757 0.1556 1.77 0.0763
##
fit.univ2<-lrm(death ~ rcs(Diastolic.BP,4)+Age+rcs(Poverty.index.sqrt,3)+
Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ rcs(Serum.Iron.sqrt,6)+ rcs(Serum.Magnesium,3)+
Sex+White.blood.cells.log ,data = survey_mice, y=T, x=T)
fit.univ2## Logistic Regression Model
##
## lrm(formula = death ~ rcs(Diastolic.BP, 4) + Age + rcs(Poverty.index.sqrt,
## 3) + Sedimentation.rate.log + Serum.Albumin + Serum.Cholesterol.log +
## rcs(Serum.Iron.sqrt, 6) + rcs(Serum.Magnesium, 3) + Sex +
## White.blood.cells.log, data = survey_mice, x = T, y = T)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 6989 LR chi2 1641.47 R2 0.360 C 0.852
## 0 5888 d.f. 18 g 1.969 Dxy 0.705
## 1 1101 Pr(> chi2) <0.0001 gr 7.160 gamma 0.705
## max |deriv| 8e-09 gp 0.187 tau-a 0.187
## Brier 0.100
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -0.0632 1.7779 -0.04 0.9716
## Diastolic.BP -0.0290 0.0127 -2.29 0.0217
## Diastolic.BP' 0.0829 0.0426 1.95 0.0515
## Diastolic.BP'' -0.1826 0.1335 -1.37 0.1714
## Age 0.0916 0.0037 24.80 <0.0001
## Poverty.index.sqrt -0.0622 0.0156 -3.99 <0.0001
## Poverty.index.sqrt' 0.0396 0.0190 2.09 0.0370
## Sedimentation.rate.log 0.2299 0.0504 4.57 <0.0001
## Serum.Albumin -0.3835 0.1261 -3.04 0.0024
## Serum.Cholesterol.log -0.2256 0.1876 -1.20 0.2292
## Serum.Iron.sqrt 0.0723 0.1234 0.59 0.5583
## Serum.Iron.sqrt' -2.8645 1.0242 -2.80 0.0052
## Serum.Iron.sqrt'' 21.6862 6.9312 3.13 0.0018
## Serum.Iron.sqrt''' -43.1089 14.3669 -3.00 0.0027
## Serum.Iron.sqrt'''' 34.6205 13.2011 2.62 0.0087
## Serum.Magnesium -1.7665 0.4986 -3.54 0.0004
## Serum.Magnesium' 1.0658 0.5606 1.90 0.0573
## Sex=female -1.0137 0.0819 -12.38 <0.0001
## White.blood.cells.log 0.7392 0.1433 5.16 <0.0001
##
Forward algorithm
nothing <- glm(death~1, family=binomial, data = survey_mice)
forwards <- step(nothing, scope=list(lower = formula(nothing),upper = formula(fit.multi.glm)),
direction ="forward",trace=FALSE )
#lrm model to compare with other
fit.for<-lrm(death ~ Age + Sex + rcs(Poverty.index.sqrt, 4) + Sedimentation.rate.log +
White.blood.cells.log + rcs(Red.blood.cells.log, 3) + rcs(Systolic.BP,4) + rcs(BMI.log, 4) +
rcs(Serum.Magnesium, 3) + rcs(Serum.Iron.sqrt,
6) + Serum.Albumin + rcs(Serum.Protein, 3) + TIBC, data = survey_mice, x=T, y=T)
fit.for## Logistic Regression Model
##
## lrm(formula = death ~ Age + Sex + rcs(Poverty.index.sqrt, 4) +
## Sedimentation.rate.log + White.blood.cells.log + rcs(Red.blood.cells.log,
## 3) + rcs(Systolic.BP, 4) + rcs(BMI.log, 4) + rcs(Serum.Magnesium,
## 3) + rcs(Serum.Iron.sqrt, 6) + Serum.Albumin + rcs(Serum.Protein,
## 3) + TIBC, data = survey_mice, x = T, y = T)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 6989 LR chi2 1704.36 R2 0.372 C 0.857
## 0 5888 d.f. 26 g 1.978 Dxy 0.714
## 1 1101 Pr(> chi2) <0.0001 gr 7.228 gamma 0.714
## max |deriv| 4e-08 gp 0.189 tau-a 0.190
## Brier 0.098
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 9.3347 3.1291 2.98 0.0029
## Age 0.0872 0.0040 22.04 <0.0001
## Sex=female -1.0849 0.0881 -12.32 <0.0001
## Poverty.index.sqrt -0.0213 0.0287 -0.74 0.4590
## Poverty.index.sqrt' -0.1375 0.1371 -1.00 0.3156
## Poverty.index.sqrt'' 0.5080 0.4061 1.25 0.2110
## Sedimentation.rate.log 0.2078 0.0556 3.74 0.0002
## White.blood.cells.log 0.7904 0.1463 5.40 <0.0001
## Red.blood.cells.log -2.7339 0.7465 -3.66 0.0003
## Red.blood.cells.log' 3.6870 0.8722 4.23 <0.0001
## Systolic.BP -0.0143 0.0115 -1.24 0.2142
## Systolic.BP' 0.0721 0.0451 1.60 0.1100
## Systolic.BP'' -0.1523 0.1041 -1.46 0.1432
## BMI.log -2.6905 0.8082 -3.33 0.0009
## BMI.log' 4.1543 2.7439 1.51 0.1300
## BMI.log'' -7.8545 8.9974 -0.87 0.3827
## Serum.Magnesium -1.5115 0.5041 -3.00 0.0027
## Serum.Magnesium' 0.8885 0.5665 1.57 0.1168
## Serum.Iron.sqrt 0.1824 0.1267 1.44 0.1498
## Serum.Iron.sqrt' -3.3076 1.0397 -3.18 0.0015
## Serum.Iron.sqrt'' 22.9178 7.0056 3.27 0.0011
## Serum.Iron.sqrt''' -43.3547 14.4986 -2.99 0.0028
## Serum.Iron.sqrt'''' 33.4097 13.3147 2.51 0.0121
## Serum.Albumin -0.5633 0.1449 -3.89 0.0001
## Serum.Protein -0.0216 0.1795 -0.12 0.9042
## Serum.Protein' 0.2821 0.1821 1.55 0.1213
## TIBC 0.0014 0.0008 1.88 0.0603
##
summary(fit.for)## Effects Response : death
##
## Factor Low High Diff. Effect S.E.
## Age 35.0000 66.0000 31.00000 2.703500 0.122680
## Odds Ratio 35.0000 66.0000 31.00000 14.932000 NA
## Poverty.index.sqrt 11.7050 19.3130 7.60850 -0.391760 0.108140
## Odds Ratio 11.7050 19.3130 7.60850 0.675870 NA
## Sedimentation.rate.log 1.9459 3.0910 1.14510 0.237960 0.063618
## Odds Ratio 1.9459 3.0910 1.14510 1.268700 NA
## White.blood.cells.log 1.7918 2.1518 0.36000 0.284550 0.052684
## Odds Ratio 1.7918 2.1518 0.36000 1.329200 NA
## Red.blood.cells.log 1.4816 1.6194 0.13778 0.008542 0.058766
## Odds Ratio 1.4816 1.6194 0.13778 1.008600 NA
## Systolic.BP 118.0000 150.0000 32.00000 0.256800 0.108930
## Odds Ratio 118.0000 150.0000 32.00000 1.292800 NA
## BMI.log 3.0974 3.3433 0.24586 -0.159050 0.102440
## Odds Ratio 3.0974 3.3433 0.24586 0.852950 NA
## Serum.Magnesium 1.5900 1.7700 0.18000 -0.152120 0.045735
## Odds Ratio 1.5900 1.7700 0.18000 0.858890 NA
## Serum.Iron.sqrt 8.6603 11.0000 2.33970 -0.186020 0.103070
## Odds Ratio 8.6603 11.0000 2.33970 0.830260 NA
## Serum.Albumin 4.2000 4.6000 0.40000 -0.225340 0.057971
## Odds Ratio 4.2000 4.6000 0.40000 0.798250 NA
## Serum.Protein 6.8000 7.4000 0.60000 0.113990 0.054043
## Odds Ratio 6.8000 7.4000 0.60000 1.120700 NA
## TIBC 323.0000 396.0000 73.00000 0.104660 0.055709
## Odds Ratio 323.0000 396.0000 73.00000 1.110300 NA
## Sex - male:female 2.0000 1.0000 NA 1.084900 0.088062
## Odds Ratio 2.0000 1.0000 NA 2.959000 NA
## Lower 0.95 Upper 0.95
## 2.4631000 2.944000
## 11.7410000 18.991000
## -0.6037100 -0.179810
## 0.5467800 0.835430
## 0.1132700 0.362650
## 1.1199000 1.437100
## 0.1812900 0.387810
## 1.1988000 1.473700
## -0.1066400 0.123720
## 0.8988500 1.131700
## 0.0432990 0.470310
## 1.0443000 1.600500
## -0.3598400 0.041737
## 0.6977900 1.042600
## -0.2417600 -0.062482
## 0.7852500 0.939430
## -0.3880300 0.015996
## 0.6783900 1.016100
## -0.3389600 -0.111720
## 0.7125100 0.894300
## 0.0080670 0.219910
## 1.0081000 1.246000
## -0.0045309 0.213850
## 0.9954800 1.238400
## 0.9122700 1.257500
## 2.4900000 3.516500
Bacward and forward
bothways <- step(nothing, list(lower=formula(nothing), upper = formula(fit.multi.glm)),
direction = "both", trace = 0)
fit.both<-(death ~ Age + Sex + Sedimentation.rate.log + rcs(Poverty.index.sqrt, 4) +
White.blood.cells.log + rcs(Systolic.BP, 4) + rcs(BMI.log, 4) + rcs(Serum.Magnesium,3)
+ rcs(Serum.Iron.sqrt, 6) + rcs(Red.blood.cells.log, 4) + Serum.Albumin +
Serum.Protein + TIBC)
#same as forwardModels’ performance
######################## Performance statistics
s <- fit.multi.lrm$stats # performance of the estimated model
round(s, digits=3) # C is the AUC and Dxy is related to C (Dxy=2*(C-0.5)), are both discrimination measures## Obs Max Deriv Model L.R. d.f. P C Dxy
## 6989.000 0.000 1716.335 33.000 0.000 0.858 0.716
## Gamma Tau-a R2 Brier g gr gp
## 0.716 0.190 0.374 0.098 1.986 7.287 0.190
gamma.hat <- (s['Model L.R.'] - s['d.f.'])/s['Model L.R.'] #shrinkage coefficient
s1 <- fit.multi.lrm_res$stats # performance of the estimated model
round(s1, digits=3) # C is the AUC and Dxy is related to C (Dxy=2*(C-0.5)), are both discrimination measures## Obs Max Deriv Model L.R. d.f. P C Dxy
## 6989.000 0.000 1680.786 21.000 0.000 0.855 0.709
## Gamma Tau-a R2 Brier g gr gp
## 0.710 0.188 0.368 0.098 1.953 7.047 0.188
gamma.hat1 <- (s1['Model L.R.'] - s1['d.f.'])/s1['Model L.R.'] #shrinkage coefficient
#For both model C is equal 0.858, in the simpler model Dxy is slightly worse (from 0.717 to
# 0.715). But the shrinkage coedfficient increases (from 0.980 to 0.985)
s2 <- fit.bw1$stats # performance of the estimated model
round(s2, digits=3)## Obs Max Deriv Model L.R. d.f. P C Dxy
## 6989.000 0.000 1689.999 19.000 0.000 0.856 0.712
## Gamma Tau-a R2 Brier g gr gp
## 0.712 0.189 0.369 0.098 1.980 7.240 0.189
gamma.hat2 <- (s2['Model L.R.'] - s2['d.f.'])/s2['Model L.R.'] #shrinkage coefficient
s3 <- fit.for$stats # performance of the estimated model
round(s3, digits=3) ## Obs Max Deriv Model L.R. d.f. P C Dxy
## 6989.000 0.000 1704.365 26.000 0.000 0.857 0.714
## Gamma Tau-a R2 Brier g gr gp
## 0.714 0.190 0.372 0.098 1.978 7.228 0.189
gamma.hat3 <- (s3['Model L.R.'] - s3['d.f.'])/s3['Model L.R.'] #shrinkage coefficient
s4 <- fit.univ$stats # performance of the estimated model
round(s4, digits=3) ## Obs Max Deriv Model L.R. d.f. P C Dxy
## 6989.000 0.000 1661.589 23.000 0.000 0.854 0.708
## Gamma Tau-a R2 Brier g gr gp
## 0.708 0.188 0.364 0.099 1.964 7.125 0.188
gamma.hat4 <- (s4['Model L.R.'] - s4['d.f.'])/s4['Model L.R.'] #shrinkage coefficient
s5 <- fit.univ2$stats # performance of the estimated model
round(s5, digits=3) ## Obs Max Deriv Model L.R. d.f. P C Dxy
## 6989.000 0.000 1641.467 18.000 0.000 0.852 0.705
## Gamma Tau-a R2 Brier g gr gp
## 0.705 0.187 0.360 0.100 1.969 7.160 0.187
gamma.hat5 <- (s5['Model L.R.'] - s5['d.f.'])/s5['Model L.R.'] #shrinkage coefficientQuestion 5: Model Discrimination and Evaluation
# Let us now use the bootstrap to further evaluate calibration and discrimination of this reduced model:
## Full model
fmult <- update(fit.multi.lrm , x=TRUE , y=TRUE)
vmult <- validate (fmult, B=200)
#vmult # Dxy (0,7167 to 0.7049) (opt 0.0118) Slope (1-0.9629) (opt 0.0371)
cal.full <- calibrate(fmult, B=200)
## Hand bw model
vhand <- validate (fit.multi.lrm_res, B=200)
#vhand # Dxy 0.7093-0.7032 (opt 0.0061), slope 1 -0.9798 (opt 0.0202)
cal.hbw <- calibrate(fit.multi.lrm_res, B=200)
## Bw model
#fhand <- update(fit.multi.lrm_res, x=TRUE , y=TRUE)
vbw <- validate (fit.bw1, B=200)
#vbw # Dxy(0.7136-0.7070) opt(0.0066) slope(1-0.9791) (opt 0.0209)
cal.bw <- calibrate(fit.bw1, B=200)
## forward model
ffor <- update(fit.for, x=TRUE , y=TRUE)
vfor <- validate (ffor, B=200)
#vfor # Dxy(0.7141-0.7063) opt(0.0077) slope(1-0.9737) (opt 0.0263)
cal.for <- calibrate(ffor, B=200)
## univariate model
funiv <- update(fit.univ, x=TRUE , y=TRUE)
vuniv <- validate (funiv, B=200)
#vuniv # Dxy( 0.7078-0.7000) opt(0.0078) slope(1-0.9753) (opt 0.0247)
cal.univ <- calibrate(funiv, B=200)
## univariate reduced model
funiv2 <- update(fit.univ2, x=TRUE , y=TRUE)
vuniv2 <- validate (funiv2, B=200)
#vuniv2 # Dxy(0.7048- 0.6999) opt(0.0049) slope(1-0.9838) (opt 0.0162)
cal.univ2 <- calibrate(funiv2, B=200)Discrimination: AUC
# predicted probability by model
full.pred <- predict(fit.multi.lrm, type="fitted")
red.pred <- predict(fit.multi.lrm_res, type="fitted")
bw.pred <- predict(fit.bw1, type="fitted")
for.pred <- predict(fit.for, type="fitted")
univ.pred<-predict(fit.univ, type="fitted")
univ.pred2<-predict(fit.univ2, type = "fitted")
dati.pp <- data.frame(survey_mice, full.pred, red.pred, bw.pred, for.pred, univ.pred, univ.pred2)
dati.pp$Race <- as.numeric(dati.pp$Race)
dati.pp$Sex <- as.numeric(dati.pp$Sex)
dati.pp$death <- as.numeric(dati.pp$death)
dati.pp$death <- dati.pp$death -1
###### compare FULL model and BACKWARD BY HAND model
dati.pp<-select(dati.pp, -c("systolic_test", "status"))
roc.comp <- roc(death ~ full.pred+red.pred, data = dati.pp)
summary.roc(roc.comp) Call:
roc(formula = death ~ full.pred + red.pred, data = dati.pp)
Error Method: delong
One-sided test (AUC > 0.5)
Markers:
ROC AUC Std. Error z value Pr(>z) lower.95 upper.95
full.pred 0.8580 0.0056 63.48 0 0.8470 0.8691
red.pred 0.8547 0.0057 61.87 0 0.8434 0.8659
Overall Test for Equality of Areas:
Chi-square( 1 df ): 10.5167 P-value: 0.0012
# there is evidence that the 2 curves are statistically different.
par(pty="s")
model0 <- glm(death ~ full.pred, data = dati.pp, family=binomial)
lroc0 <- lroc2(model0,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="black")
model1 <- glm(death ~ red.pred, data = dati.pp, family=binomial)
plot.newlroc1 <- lroc2(model1,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="red")
legend("bottomright", legend=c("Full Model","Hand Backward"), col=c("black", "red"), lty=c(1,1), bty="n")# Indeed the plot shows no difference between the 2 curves
#### compare FULL model and BACKWARD BY ALGORITHM
roc.comp2 <- roc(death ~ full.pred+bw.pred, data = dati.pp)
summary.roc(roc.comp2) # little evidence of difference between the 2 curves, the full model is more discriminative, Call:
roc(formula = death ~ full.pred + bw.pred, data = dati.pp)
Error Method: delong
One-sided test (AUC > 0.5)
Markers:
ROC AUC Std. Error z value Pr(>z) lower.95 upper.95
full.pred 0.8580 0.0056 63.48 0 0.8470 0.8691
bw.pred 0.8559 0.0057 62.51 0 0.8447 0.8670
Overall Test for Equality of Areas:
Chi-square( 1 df ): 6.4885 P-value: 0.0109
# however the 2 curves are comparable
# but full model is less calibrate
par(pty="s")
lroc0 <- lroc2(model0,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="black")
model3 <- glm(death ~ bw.pred, data = dati.pp, family=binomial)
plot.newlroc3 <- lroc2(model3,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="red")
legend("bottomright", legend=c("Full Model","Backward Model"), col=c("black", "red"), lty=c(1,1), bty="n")#### compare BACKWARD BY HAND and BACKWARD BY ALGORITHM
roc.comp3 <- roc(death ~ red.pred+bw.pred, data = dati.pp)
summary.roc(roc.comp3) # little evidence of difference: backward by hand more discriminative Call:
roc(formula = death ~ red.pred + bw.pred, data = dati.pp)
Error Method: delong
One-sided test (AUC > 0.5)
Markers:
ROC AUC Std. Error z value Pr(>z) lower.95 upper.95
red.pred 0.8547 0.0057 61.87 0 0.8434 0.8659
bw.pred 0.8559 0.0057 62.51 0 0.8447 0.8670
Overall Test for Equality of Areas:
Chi-square( 1 df ): 3.1388 P-value: 0.0765
par(pty="s")
lroc4 <- lroc2(model1,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="black")
model5 <- glm(death ~ bw.pred, data = dati.pp, family=binomial)
plot.newlroc5 <- lroc2(model5,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="red")
lroc5$aucFALSE [1] 0.8558896
legend("bottomright", legend=c("Bw hand","Bw algorithm"), col=c("black", "red"), lty=c(1,1), bty="n")# comparison of 3 models
par(pty="s")
model0 <- glm(death ~ full.pred, data = dati.pp, family=binomial)
lroc0 <- lroc2(model0,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="black")
model2 <- glm(death ~ red.pred, data = dati.pp, family=binomial)
plot.newlroc22 <- lroc2(model2,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="red")
model22 <- glm(death ~ bw.pred, data = dati.pp, family=binomial)
plot.newlroc222 <- lroc2(model22,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="blue")
legend("bottomright", legend=c("Full Model","Bw by hand", "bw by algorithm"), col=c("black", "blue", "red"), lty=c(1,1,1), bty="n")# all the three models are comparable in terms of discriminative ability, but the backward model are more calibrate
# since are more parsimonious.
#### compare FORWARD model and BACKWARD BY ALGORITHM
roc.comp4 <- roc(death ~ for.pred+bw.pred, data = dati.pp)
summary.roc(roc.comp4) # no evidence of difference Call:
roc(formula = death ~ for.pred + bw.pred, data = dati.pp)
Error Method: delong
One-sided test (AUC > 0.5)
Markers:
ROC AUC Std. Error z value Pr(>z) lower.95 upper.95
for.pred 0.8570 0.0057 63.00 0 0.8459 0.8681
bw.pred 0.8559 0.0057 62.51 0 0.8447 0.8670
Overall Test for Equality of Areas:
Chi-square( 1 df ): 3.18 P-value: 0.0745
# comparison of 4 models
par(pty="s")
model0 <- glm(death ~ full.pred, data = dati.pp, family=binomial)
lroc0 <- lroc2(model0,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="black")
model2 <- glm(death ~ red.pred, data = dati.pp, family=binomial)
plot.newlroc22 <- lroc2(model2,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="red")
model22 <- glm(death ~ bw.pred, data = dati.pp, family=binomial)
plot.newlroc222 <- lroc2(model22,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="blue")
model23 <- glm(death ~ for.pred, data = dati.pp, family=binomial)
plot.newlroc23 <- lroc2(model23,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="green")
model5 <- glm(death ~ univ.pred, data = dati.pp, family=binomial)
plot.newlroc5 <- lroc2(model5,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="purple")
model6 <- glm(death ~ univ.pred2, data = dati.pp, family=binomial)
plot.newlroc6 <- lroc2(model6,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="orange")
legend("bottomright", legend=c("Full Model","Bw by hand", "bw by algorithm", "forward", "univ", "univ_red"), col=c("black", "blue", "red","green", "purple", "orange"), lty=c(1,1,1), bty="n")# all the three models are comparable in terms of discriminative ability, but the backward model are more calibrate
# since are more parsimonious.
# Backward by hand and bw by algorithm seem to be the best models. Their disriminative ability is very close to
# the one of the full model, AIC smaller and are more calibrateSummary: Multivariable Model chosen
#### Calibration
par(mfrow = c(2,3))
plot(cal.full, main = "Full model") # MAE 0.004##
## n=6989 Mean absolute error=0.004 Mean squared error=3e-05
## 0.9 Quantile of absolute error=0.009
plot(cal.hbw,main = "Bw by hand") # MAE 0.004##
## n=6989 Mean absolute error=0.004 Mean squared error=2e-05
## 0.9 Quantile of absolute error=0.008
plot(cal.bw, main = "Bw by algorithm") #MAE = 0.003##
## n=6989 Mean absolute error=0.002 Mean squared error=1e-05
## 0.9 Quantile of absolute error=0.006
plot(cal.for, main = "Forward by algorithm") #MAE = 0.003 -> fit.for.s has a better calibration (MAE=0.03)##
## n=6989 Mean absolute error=0.003 Mean squared error=3e-05
## 0.9 Quantile of absolute error=0.008
plot(cal.univ, main = "Univariate Filtering") #MAE = 0.003##
## n=6989 Mean absolute error=0.003 Mean squared error=4e-05
## 0.9 Quantile of absolute error=0.007
plot(cal.univ2, main = "Univariate Filtering reduced") #MAE = 0.003##
## n=6989 Mean absolute error=0.003 Mean squared error=3e-05
## 0.9 Quantile of absolute error=0.006
col<-c("full", "hand_bw", "bw","forward", "univ", "univ_red")
#### Shrinkage
gamma.t <- c(gamma.hat, gamma.hat1, gamma.hat2, gamma.hat3, gamma.hat4, gamma.hat5)
# univ_red is the better,indicating that this model will validate
#on new data about 1% worse than on this dataset (simpler than the full model, less overfitting).
m <- matrix(gamma.t, nrow = 1, byrow = TRUE,
dimnames = list("Shrink", col))
m## full hand_bw bw forward univ univ_red
## Shrink 0.980773 0.9875058 0.9887574 0.9847451 0.9861578 0.9890342
#### AIC
AIC.models <- c(AIC(fit.multi.lrm), AIC(fit.multi.lrm_res), AIC(fit.bw1), AIC(fit.for), AIC(fit.univ), AIC(fit.univ2))
mAIC <- matrix(AIC.models, nrow = 1, byrow = TRUE,
dimnames = list("AIC", col))
mAIC # forward is the better## full hand_bw bw forward univ univ_red
## AIC 4439.877 4451.426 4438.213 4437.847 4474.623 4484.745
#### AUC
area <- c(lroc0$auc, lroc22$auc, lroc222$auc, lroc23$auc, lroc5$auc, lroc6$auc)
mAuc <- matrix(area, nrow = 1, byrow = TRUE,
dimnames = list("AUC", col))
mAuc## full hand_bw bw forward univ univ_red
## AUC 0.8580496 0.8546574 0.8558896 0.8570258 0.8539021 0.8523997
Considering all the results returned by different statistical procedures we can conclude that all the models are very close in term of discrimination and calibration. The best trade-off between this two measures is given by fit.bw1, that is the model obtained by the algorithm fastbw() removing the non linear effect of Poverty.index.sqrt (as suggested by anova and by p-value of coefficients).
fit.bw1## Logistic Regression Model
##
## lrm(formula = death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log,
## 3) + Sedimentation.rate.log + Serum.Albumin + rcs(Serum.Magnesium,
## 3) + Systolic.BP + Serum.Protein + Sex + rcs(BMI.log, 3) +
## rcs(Serum.Iron.sqrt, 6) + White.blood.cells.log, data = survey_mice,
## x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 6989 LR chi2 1690.00 R2 0.369 C 0.856
## 0 5888 d.f. 19 g 1.980 Dxy 0.712
## 1 1101 Pr(> chi2) <0.0001 gr 7.240 gamma 0.712
## max |deriv| 1e-08 gp 0.189 tau-a 0.189
## Brier 0.098
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 5.4151 2.1170 2.56 0.0105
## Age 0.0862 0.0038 22.58 <0.0001
## Poverty.index.sqrt -0.0295 0.0065 -4.55 <0.0001
## Red.blood.cells.log -2.8444 0.7427 -3.83 0.0001
## Red.blood.cells.log' 3.8544 0.8660 4.45 <0.0001
## Sedimentation.rate.log 0.2081 0.0553 3.76 0.0002
## Serum.Albumin -0.5789 0.1418 -4.08 <0.0001
## Serum.Magnesium -1.6249 0.5016 -3.24 0.0012
## Serum.Magnesium' 1.0067 0.5631 1.79 0.0738
## Systolic.BP 0.0077 0.0016 4.70 <0.0001
## Serum.Protein 0.2456 0.0878 2.80 0.0052
## Sex=female -1.0500 0.0864 -12.15 <0.0001
## BMI.log -2.4128 0.4862 -4.96 <0.0001
## BMI.log' 2.2768 0.5638 4.04 <0.0001
## Serum.Iron.sqrt 0.1596 0.1261 1.27 0.2054
## Serum.Iron.sqrt' -3.2556 1.0373 -3.14 0.0017
## Serum.Iron.sqrt'' 23.0333 6.9910 3.29 0.0010
## Serum.Iron.sqrt''' -44.1589 14.4662 -3.05 0.0023
## Serum.Iron.sqrt'''' 34.4816 13.2855 2.60 0.0094
## White.blood.cells.log 0.7798 0.1460 5.34 <0.0001
##
confint.default(fit.bw1)## 2.5 % 97.5 %
## Intercept 1.265788898 9.56444370
## Age 0.078750845 0.09372112
## Poverty.index.sqrt -0.042235879 -0.01679788
## Red.blood.cells.log -4.299935352 -1.38879861
## Red.blood.cells.log' 2.156936446 5.55176780
## Sedimentation.rate.log 0.099736610 0.31652637
## Serum.Albumin -0.856930456 -0.30092611
## Serum.Magnesium -2.607915772 -0.64187393
## Serum.Magnesium' -0.096935532 2.11041824
## Systolic.BP 0.004483579 0.01089159
## Serum.Protein 0.073505846 0.41768159
## Sex=female -1.219442236 -0.88058923
## BMI.log -3.365788052 -1.45975345
## BMI.log' 1.171728220 3.38180726
## Serum.Iron.sqrt -0.087428400 0.40672010
## Serum.Iron.sqrt' -5.288672971 -1.22248731
## Serum.Iron.sqrt'' 9.331134522 36.73551253
## Serum.Iron.sqrt''' -72.512171187 -15.80554885
## Serum.Iron.sqrt'''' 8.442552510 60.52072836
## White.blood.cells.log 0.493738111 1.06585391
We have to keep in mind that coefficients \(\Beta\) in logistic regression represents the change in the log odds when the predictors inscreasing by one unit. In order to interpret the results it easier reasoning on the Odds Ratio scale. The Odds ratio is a measure of association between the predictor and the outcome:
- OR > 1 indicates an increase of the probability of the outcome
- OR = 1 indicates a zero effect of the predictor on the probability of the outcome
- 0R > 1 means that predictor has a protective effect, decreases the probbability of the outcome.
If the predictor is continous and we exponentiate its coefficient estimated by the model we obtain the estimate of OR for an increase of one unit of the predictor keeping all the other covariates constant, while from summary(model) we obtain the effect in an interquartile range. If the predictor is binary or categorical we obtain the estimate of OR of a level with respect to the chosen baseline level.
#Age
exp(fit.bw1$coefficients)[2]## Age
## 1.090064
### Effect on odds ratio scale
# Interquartile-range odds ratios for continuous predictors and simple odds
# ratios for categorical predictors. Numbers at left are upper quartile vs lower quartile or
# current group vs reference group. The bars represent confidence limits.
# The intervals are drawn on the log odds ratio scale and labeled on the odds ratio
# scale. Ranges are on the original scale.
summary(fit.bw1)## Effects Response : death
##
## Factor Low High Diff. Effect S.E.
## Age 35.0000 66.0000 31.00000 2.673300 0.118390
## Odds Ratio 35.0000 66.0000 31.00000 14.488000 NA
## Poverty.index.sqrt 11.7050 19.3130 7.60850 -0.224580 0.049375
## Odds Ratio 11.7050 19.3130 7.60850 0.798850 NA
## Red.blood.cells.log 1.4816 1.6194 0.13778 0.010799 0.058147
## Odds Ratio 1.4816 1.6194 0.13778 1.010900 NA
## Sedimentation.rate.log 1.9459 3.0910 1.14510 0.238340 0.063331
## Odds Ratio 1.9459 3.0910 1.14510 1.269100 NA
## Serum.Albumin 4.2000 4.6000 0.40000 -0.231570 0.056736
## Odds Ratio 4.2000 4.6000 0.40000 0.793290 NA
## Serum.Magnesium 1.5900 1.7700 0.18000 -0.156570 0.045607
## Odds Ratio 1.5900 1.7700 0.18000 0.855070 NA
## Systolic.BP 118.0000 150.0000 32.00000 0.246000 0.052311
## Odds Ratio 118.0000 150.0000 32.00000 1.278900 NA
## Serum.Protein 6.8000 7.4000 0.60000 0.147360 0.052681
## Odds Ratio 6.8000 7.4000 0.60000 1.158800 NA
## BMI.log 3.0974 3.3433 0.24586 -0.221970 0.055039
## Odds Ratio 3.0974 3.3433 0.24586 0.800940 NA
## Serum.Iron.sqrt 8.6603 11.0000 2.33970 -0.180050 0.102540
## Odds Ratio 8.6603 11.0000 2.33970 0.835220 NA
## White.blood.cells.log 1.7918 2.1518 0.36000 0.280730 0.052543
## Odds Ratio 1.7918 2.1518 0.36000 1.324100 NA
## Sex - male:female 2.0000 1.0000 NA 1.050000 0.086444
## Odds Ratio 2.0000 1.0000 NA 2.857700 NA
## Lower 0.95 Upper 0.95
## 2.441300 2.905400
## 11.488000 18.272000
## -0.321350 -0.127810
## 0.725170 0.880020
## -0.103170 0.124770
## 0.901980 1.132900
## 0.114210 0.362460
## 1.121000 1.436900
## -0.342770 -0.120370
## 0.709800 0.886590
## -0.245960 -0.067182
## 0.781950 0.935030
## 0.143470 0.348530
## 1.154300 1.417000
## 0.044104 0.250610
## 1.045100 1.284800
## -0.329850 -0.114100
## 0.719030 0.892170
## -0.381030 0.020924
## 0.683160 1.021100
## 0.177750 0.383710
## 1.194500 1.467700
## 0.880590 1.219400
## 2.412300 3.385300
plot(summary(fit.bw1),log=TRUE)From this model we have a protective effect on the risk of death for :Poverty.index.sqrt, BMI.log, Serum.Albumin, Serum.Magnesium, BMI.log. For example for Poverty.index.sqrt means that being richer decreases the risk of death.
We have to be aware that for the value trnsformed we have to apply exp() or power of 2 if we use log or sqrt respectively. For example we have that patients with Poverty index of \((19,3)^2\)(136) have 25% less risk to death than patient with index $11,7 (361).
On the contrary the probability of death is increased by the predictors Age, Sedimentation.rate.log, Systolic.BP, Serum. Protein, White.blood.cell. This means that the increase of this predictors increses the risk of death. Furthermore it results that male subjects seem to have a 2.85 times higher risk than female ones.
Question 6: Present the results
Represent for the clinical audience the results from the estimated model
survey_mice$Sexf <- as.factor(survey_mice$Sex)
levels(survey_mice$Sexf) <- c("male","female")
label(survey_mice$Sexf) <- "Sex"
options(datadist='dd')
dd <-datadist(survey_mice)
formula(fit.bw1)## death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log, 3) +
## Sedimentation.rate.log + Serum.Albumin + rcs(Serum.Magnesium,
## 3) + Systolic.BP + Serum.Protein + Sex + rcs(BMI.log, 3) +
## rcs(Serum.Iron.sqrt, 6) + White.blood.cells.log
fredfin <- lrm(death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log, 3) + Sedimentation.rate.log +
Serum.Albumin + rcs(Serum.Magnesium, 3) + Systolic.BP + Serum.Protein +
Sex + rcs(BMI.log, 3) + rcs(Serum.Iron.sqrt, 6) + White.blood.cells.log ,data = survey_mice)
fredfin## Logistic Regression Model
##
## lrm(formula = death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log,
## 3) + Sedimentation.rate.log + Serum.Albumin + rcs(Serum.Magnesium,
## 3) + Systolic.BP + Serum.Protein + Sex + rcs(BMI.log, 3) +
## rcs(Serum.Iron.sqrt, 6) + White.blood.cells.log, data = survey_mice)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 6989 LR chi2 1690.00 R2 0.369 C 0.856
## 0 5888 d.f. 19 g 1.980 Dxy 0.712
## 1 1101 Pr(> chi2) <0.0001 gr 7.240 gamma 0.712
## max |deriv| 1e-08 gp 0.189 tau-a 0.189
## Brier 0.098
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 5.4151 2.1170 2.56 0.0105
## Age 0.0862 0.0038 22.58 <0.0001
## Poverty.index.sqrt -0.0295 0.0065 -4.55 <0.0001
## Red.blood.cells.log -2.8444 0.7427 -3.83 0.0001
## Red.blood.cells.log' 3.8544 0.8660 4.45 <0.0001
## Sedimentation.rate.log 0.2081 0.0553 3.76 0.0002
## Serum.Albumin -0.5789 0.1418 -4.08 <0.0001
## Serum.Magnesium -1.6249 0.5016 -3.24 0.0012
## Serum.Magnesium' 1.0067 0.5631 1.79 0.0738
## Systolic.BP 0.0077 0.0016 4.70 <0.0001
## Serum.Protein 0.2456 0.0878 2.80 0.0052
## Sex=female -1.0500 0.0864 -12.15 <0.0001
## BMI.log -2.4128 0.4862 -4.96 <0.0001
## BMI.log' 2.2768 0.5638 4.04 <0.0001
## Serum.Iron.sqrt 0.1596 0.1261 1.27 0.2054
## Serum.Iron.sqrt' -3.2556 1.0373 -3.14 0.0017
## Serum.Iron.sqrt'' 23.0333 6.9910 3.29 0.0010
## Serum.Iron.sqrt''' -44.1589 14.4662 -3.05 0.0023
## Serum.Iron.sqrt'''' 34.4816 13.2855 2.60 0.0094
## White.blood.cells.log 0.7798 0.1460 5.34 <0.0001
##
nom <- nomogram (fredfin,
fun=plogis , funlabel ="Probability ",
lp=TRUE,
fun.at =c(.01,.1,.25,.5,.75,.95))
plot(
nom,
cex.axis = 0.75 ,
cex.var = 0.75 ,
#col.grid = gray(c(0.8, 0.95))
)nom## Points per unit of linear predictor: 23.19217
## Linear predictor units per point : 0.04311799
##
##
## Age Points
## 25 0
## 30 10
## 35 20
## 40 30
## 45 40
## 50 50
## 55 60
## 60 70
## 65 80
## 70 90
## 75 100
##
##
## Poverty.index.sqrt Points
## 0 24
## 5 21
## 10 17
## 15 14
## 20 10
## 25 7
## 30 3
## 35 0
##
##
## Red.blood.cells.log Points
## 0.9 39
## 1.0 32
## 1.1 26
## 1.2 19
## 1.3 12
## 1.4 6
## 1.5 0
## 1.6 0
## 1.7 6
## 1.8 13
## 1.9 20
## 2.0 27
## 2.1 34
## 2.2 40
##
##
## Sedimentation.rate.log Points
## 0.0 0
## 0.5 2
## 1.0 5
## 1.5 7
## 2.0 10
## 2.5 12
## 3.0 14
## 3.5 17
## 4.0 19
## 4.5 22
##
##
## Serum.Albumin Points
## 2.5 47
## 3.0 40
## 3.5 34
## 4.0 27
## 4.5 20
## 5.0 13
## 5.5 7
## 6.0 0
##
##
## Serum.Magnesium Points
## 0.8 36
## 1.0 29
## 1.2 21
## 1.4 14
## 1.6 6
## 1.8 3
## 2.0 2
## 2.2 2
## 2.4 1
## 2.6 1
## 2.8 0
##
##
## Systolic.BP Points
## 80 0
## 100 4
## 120 7
## 140 11
## 160 14
## 180 18
## 200 21
## 220 25
## 240 29
## 260 32
##
##
## Serum.Protein Points
## 4 0
## 5 6
## 6 11
## 7 17
## 8 23
## 9 28
## 10 34
## 11 40
## 12 46
##
##
## Sex Points
## male 24
## female 0
##
##
## BMI.log Points
## 2.4 43
## 2.6 32
## 2.8 21
## 3.0 10
## 3.2 0
## 3.4 0
## 3.6 3
## 3.8 7
## 4.0 10
## 4.2 14
## 4.4 17
##
##
## Serum.Iron.sqrt Points
## 4 0
## 6 7
## 8 13
## 10 4
## 12 3
## 14 8
## 16 14
## 18 20
## 20 25
##
##
## White.blood.cells.log Points
## 0.5 0
## 1.0 9
## 1.5 18
## 2.0 27
## 2.5 36
## 3.0 45
## 3.5 54
## 4.0 63
## 4.5 72
##
##
## Total Points Probability
## 128 0.01
## 184 0.10
## 209 0.25
## 235 0.50
## 260 0.75
## 303 0.95
Question 7: Machine Learning model XGBoost
Why? XGBoost is an algorithm that has recently been dominating applied machine learning competitions, also in biomedical research. It has a wide range of applications: Can be used to solve regression, classification, ranking, and user-defined prediction problems. XGBoost stands for eXtreme Gradient Boosting and is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. Gradient Boosting specifically is an approach where new models are trained to predict the residuals (i.e errors) of prior models. Iteratively a new tree is built to predict the residual errors of the previous tree, which are then combined with previous tree for final predictions (Boosting). When adding new models the losses is minimized using gradient descent algorithm.
survey_mice$death <- as.factor(survey_mice$death)
#split data in train and test set
# splitting assuring that outcome distribution is equal in both train and test
trainIndex <- createDataPartition(survey_mice$death, p = .8,
list = FALSE,
times = 1)
surveyTrain <- survey_mice[ trainIndex,]
surveyTest <- survey_mice[-trainIndex,]
options(na.action='na.pass')
#implement one hot encoding: all the variables must be numeric
new.tr <- model.matrix(~., data = surveyTrain %>% dplyr::select(-c(death, time_death, status)))
options(na.action='na.pass')
new.ts <- model.matrix(~., data = surveyTest %>% dplyr::select(-c(death, time_death, status)))
#convert factor response to numeric
train.labels = as.numeric(surveyTrain$death)-1
test.labels = as.numeric(surveyTest$death)-1
dtrain <- xgb.DMatrix(data = new.tr, label = train.labels)
dtest <- xgb.DMatrix(data = new.ts, label = test.labels)
#tuning parameter
# gbtree = ensemble of decision trees
params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1)
xgbcv <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, showsd = T,
stratified = T,print_every_n = 10, early_stopping_rounds = 20, maximize = F, metrics = "auc", prediction = T)## [1] train-auc:0.863656+0.004876 test-auc:0.809192+0.012647
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
##
## [11] train-auc:0.954022+0.002061 test-auc:0.832408+0.011991
## [21] train-auc:0.977327+0.002507 test-auc:0.835268+0.010011
## Stopping. Best iteration:
## [1] train-auc:0.863656+0.004876 test-auc:0.809192+0.012647
xgbcv_er <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, showsd = T,
stratified = T,print_every_n = 10, early_stopping_rounds = 20, maximize = F, metrics = "error", prediction = T)## [1] train-error:0.121692+0.002680 test-error:0.166483+0.011247
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [11] train-error:0.079399+0.002006 test-error:0.146814+0.007907
## [21] train-error:0.056912+0.002411 test-error:0.148244+0.010093
## [31] train-error:0.040549+0.002240 test-error:0.147528+0.012315
## Stopping. Best iteration:
## [12] train-error:0.077074+0.002970 test-error:0.145384+0.007333
xgbcv_er## ##### xgb.cv 5-folds
## iter train_error_mean train_error_std test_error_mean test_error_std
## 1 0.1216916 0.0026802775 0.1664832 0.011247160
## 2 0.1102466 0.0014171098 0.1571850 0.009041203
## 3 0.1045686 0.0041199426 0.1552196 0.005343685
## 4 0.0997850 0.0028915333 0.1511062 0.006611158
## 5 0.0944650 0.0026314315 0.1512854 0.006275188
## 6 0.0918722 0.0023509253 0.1520002 0.008041120
## 7 0.0895476 0.0020719889 0.1507476 0.008527296
## 8 0.0867758 0.0008652018 0.1482436 0.009338382
## 9 0.0848534 0.0009458314 0.1489582 0.010910121
## 10 0.0823498 0.0019451724 0.1482446 0.007013073
## 11 0.0793992 0.0020064718 0.1468138 0.007906507
## 12 0.0770744 0.0029704292 0.1453836 0.007332844
## 13 0.0756886 0.0036086057 0.1468138 0.008772210
## 14 0.0730058 0.0033395608 0.1468140 0.009706942
## 15 0.0700558 0.0036734958 0.1471718 0.008832453
## 16 0.0678204 0.0027948312 0.1457406 0.010038517
## 17 0.0649144 0.0026219399 0.1464562 0.009232826
## 18 0.0625002 0.0018677320 0.1469922 0.011012184
## 19 0.0604884 0.0021593235 0.1477072 0.012866228
## 20 0.0591024 0.0016807219 0.1484228 0.011856811
## 21 0.0569120 0.0024113226 0.1482444 0.010092984
## 22 0.0544978 0.0029671626 0.1469924 0.012163061
## 23 0.0523518 0.0025344273 0.1478862 0.011497970
## 24 0.0491326 0.0022407130 0.1487802 0.011296364
## 25 0.0483276 0.0023441126 0.1482432 0.011909274
## 26 0.0468080 0.0029321842 0.1484220 0.012893784
## 27 0.0451984 0.0031890937 0.1478852 0.013350000
## 28 0.0437680 0.0022796510 0.1503892 0.011952926
## 29 0.0418456 0.0015942728 0.1487796 0.012279878
## 30 0.0414428 0.0025213293 0.1482434 0.012279533
## 31 0.0405488 0.0022401940 0.1475280 0.012314519
## 32 0.0396994 0.0020151109 0.1473492 0.012616106
## iter train_error_mean train_error_std test_error_mean test_error_std
## Best iteration:
## iter train_error_mean train_error_std test_error_mean test_error_std
## 12 0.0770744 0.002970429 0.1453836 0.007332844
# model training
xgb1 <- xgb.train (params = params, data = dtrain, nrounds = 79, watchlist = list(val=dtest,train=dtrain),
print_every_n = 10, early_stopping_round = 10, maximize = F , eval_metric = "error")## [1] val-error:0.153185 train-error:0.126252
## Multiple eval metrics are present. Will use train_error for early stopping.
## Will train until train_error hasn't improved in 10 rounds.
##
## [11] val-error:0.141732 train-error:0.085837
## [21] val-error:0.150322 train-error:0.067597
## [31] val-error:0.156764 train-error:0.044170
## [41] val-error:0.152470 train-error:0.030937
## [51] val-error:0.158196 train-error:0.017525
## [61] val-error:0.161775 train-error:0.010193
## [71] val-error:0.156049 train-error:0.006080
## [79] val-error:0.152470 train-error:0.003398
#model prediction
xgbpred <- predict (xgb1,dtest)
xgbpred <- ifelse (xgbpred > 0.5,1,0)
xgbpredf<-as.factor(xgbpred)
test.labelsf<-as.factor(xgbpred)
confusionMatrix(xgbpredf, test.labelsf)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1278 0
## 1 0 119
##
## Accuracy : 1
## 95% CI : (0.9974, 1)
## No Information Rate : 0.9148
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.9148
## Detection Rate : 0.9148
## Detection Prevalence : 0.9148
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
#Importance of variables
mat <- xgb.importance (feature_names = colnames(new.tr),model = xgb1)
xgb.ggplot.importance (importance_matrix = mat[1:19]) This ML algorithm perform very well on the test set, it correctly classify all the patients. But if we evaluate its performance by means of 5-folds cross-validation we see that it is comparable to the Logistic Regression model implemented. The AUC measure, indeed, is ~80%.
The list of most important variables returned reflects partially the ones used in the Logistic model. Both identified Age as the main contributor.
We have to consider that ML algorithm works very well on large and complex dataset and this could be a limitations when we deal with medical data.
Comments and Further extension
The log and sqrt transformation of some variables in order to make them more normally distribution leads to smaller SE
The Logistic model perform quite well with an AUC > 80%
We could try to stratify data by Sex to see if there are any differences between groups
We have highly correlated variables and this could impact the goodness of fit neverthless the estimate of coefficients. Dropping the most correlated variables is the easiest solution but not the most effective. There could be important interactions and thus possible cofounding variables not considered in this model. A clinician opinion could be useful to consider these intractions